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Preface 



This volume consists of the 42 papers presented at the International Workshop 
on Energy Minimization Methods in Computer Vision and Pattern Recognition 
(EMMCVPR2001), which was held at INRIA {Institut National de Recherche en 
Informatique et en Automatique) in Sophia Antipolis, France, from September 
3 through September 5, 2001. This workshop is the third of a series, which was 
started with EMMCVPR’97, held in Venice in May 1997, and continued with 
EMMCVR’99, which took place in York, in July 1999. 

Minimization problems and optimization methods permeate computer vision 
(CV), pattern recognition (PR), and many other fields of machine intelligence. 
The aim of the EMMCVPR workshops is to bring together people with research 
interests in this interdisciplinary topic. Although the subject is traditionally well 
represented at major international conferences on CV and PR, the EMMCVPR 
workshops provide a forum where researchers can report their recent work and 
engage in more informal discussions. 

We received 70 submissions from 23 countries, which were reviewed by the 
members of the program committee. Based on the reviews, 24 papers were ac- 
cepted for oral presentation and 18 for poster presentation. In this volume, no 
distinction is made between papers that were presented orally or as posters. 
The book is organized into five sections, whose topics coincide with the five ses- 
sions of the workshop: “Probabilistic Models and Estimation” , “Image Modelling 
and Synthesis”, “Clustering, Grouping, and Segmentation”, “Optimization and 
Graphs” , and “Shapes, Curves, Surfaces, and Templates” . 

In addition to the contributed presentations, EMMCVPR 2001 had the priv- 
ilege of including keynote talks by three distinguished scientists in the field: 
Donald Geman, Geoffrey Hinton, and David Mumford. These invited speakers 
have played seminal roles in the development of modern computer vision and 
pattern recognition, and continue to be involved in cutting-edge research. 

We would like to thank a number of people who have helped us in making 
EMMCVPR 2001 a successful workshop. We thank Marcello Pelillo and Ed- 
win Hancock for allowing us to take care of the EMMCVPR series, which they 
started, and for the important advice that made our organizational tasks easier. 
We also want to acknowledge all the program committee members for carefully 
reviewing papers for EMMCVPR. 

Finally, we thank the various organizations that have provided support for 
EMMCVPR: the International Association for Pattern Recognition, who spon- 
sored the workshop and provided publicity, the INRIA Sophia Antipolis, who 
hosted the workshop and provided financial support. 
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A Double-Loop Algorithm 
to Minimize the Bethe Free Energy 



Alan Yuille 

Smith Kettlewell Eye Research Institute, San Francisco CA 94965, USA 

yuilleOski . org 



Abstract. Recent work (Yedidia, Freeman, Weiss [22]) has shown that 
stable points of belief propagation (BP) algorithms [12] for graphs with 
loops correspond to extrema of the Bethe free energy [3]. These BP 
algorithms have been used to obtain good solutions to problems for which 
alternative algorithms fail to work [4], [5], [10] [11]. In this paper we 
introduce a discrete iterative algorithm which we prove is guaranteed 
to converge to a minimum of the Bethe free energy. We call this the 
double-loop algorithm because it contains an inner and an outer loop. 
The algorithm is developed by decomposing the free energy into a convex 
part and a concave part, see [25], and extends a class of mean field 
theory algorithms developed by [7], [8] and, in particular, [13]. Moreover, 
the double-loop algorithm is formally very similar to BP which may help 
understand when BP converges. In related work [24] we extend this work 
to the more general Kikuchi approximation [3] which includes the Bethe 
free energy as a special case. It is anticipated that these double-loop 
algorithms will be useful for solving optimization problems in computer 
vision and other applications. 



1 Introduction 

Local belief propagation (BP) algorithms [12] have long been known to eonverge 
to the correet marginal probabilities for tree-like graphical models. Reeently, sev- 
eral researchers have empirically demonstrated that BP algorithms often perform 
surprisingly well for graphs with loops [4], [5], [10] [11] and, in particular, for 
practical applications such as learning low level vision [4] and turbo decoding 
[10]. (See [11], however, for examples where BP algorithms fail). It is important 
to understand when and why BP, and related algorithms, perform well on graphs 
with loops. 

More recently Yedidia, Freeman and Weiss [22] proved that the stable fixed 
points of BP algorithms correspond to extrema of the Bethe free energy [3]. 
Therefore if a BP algorithm converges it must go to an extremum (i.e. maxi- 
mum, minimum, or saddle point) of the Bethe free energy (and empirieally this 
extremum seems always to a minimum [22]). Yedidia et al deseribe how these 
results can be generalized to Kikuchi approximations [3] which include the Bethe 
free energy as a special case. Overall, Yedidia et aFs work gives an exciting link 
between belief propagation and inference algorithms based on statistical physics. 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 3-18, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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This paper develops the connections between BP and the Bethe free energy 
(see [24] for extensions to Kikuchi free energy). The main result is a novel dis- 
crete iterative algorithm, called the double-loop algorithm which is guaranteed 
to converge to a minimum of the Bethe free energy. The double-loop algorithm 
is similar to BP because it also proceeds by passing “messages” between nodes 
of the graph (and there are interesting formal similarities between the two algo- 
rithms). 

The double-loop algorithm is developed by decomposing the free energy into 
the sum of a convex term and a concave term. This is a general principle which 
can be applied to develop discrete iterative algorithms for a range of optimization 
problems, see [25] . In particular, it can be applied to mean field theory algorithms 
for solving optimization problems [7], [8], [13]. (Physicist readers should note that 
the “mean” is with respect to the Gibbs distribution but we do not, unlike most 
physics applications, assume that the “mean field” is spatially constant). 

In Section (6), we place this work in context of mean field theory approaches 
to optimization (see [7], [8], [13] and chapters by Peterson and Yuille in [2]). This 
material is essentially a review (and some readers might prefer to read it before 
the rest of the paper). 

Section (2) describes the Bethe free energy and BP algorithms. Section (3) 
gives the two basic design principles of our double-loop algorithm: (i) showing 
how to construct discrete iterative algorithms to minimize energy functions which 
are sums of a concave and a convex term, (ii) designing an iterative update 
algorithm guaranteed to enforce linear constraints. In Section (4) we apply these 
principles to the Bethe free energy and show that the specific nature of the 
constraints means that the iterative algorithm to solve the constraints takes 
a particularly simple form. Section (5) summarizes the double-loop algorithm 
and discusses formal similarities to BP. We conclude in Section (6) by briefly 
discussing how the scheme in this paper can be extended to obtain different 
statistical estimators, how temperature annealing can be done, and how the 
approach relates directly to mean field theory algorithms. 

2 The Bethe Free Energy and the BP Algorithms 

This section introduces the Bethe Free Energy and the BP algorithm following 
the formulation of Yedidia et al [22] . 

Consider a graph with nodes i = 1, ..., TV. The state of a node is denoted by 
Xi (each x, has M possible states). Each unobserved node is connected to an 
observed node j/j. The joint probability function is given by: 

P{xi,...,XN\y) = ^ il^ij{x^,Xj)YYil^i{xi,yi), (1) 

ij:i>j i 

where tpi{xi, y.j) is the local “evidence” for node i, Z is a normalization constant, 
and xl}ij{xi,Xj) is the (symmetric) compatibility matrix between nodes i and j. 
We use the convention i > j to avoid double counting. To simplify notation, we 
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write ipi{xi) as shorthand for 'tp^{xi,yi). If nodes i,j are not connected then we 
set %jj^j{x^,Xj) = 1. 

The goal is to determine estimate of the marginal distributions 

{P{xi\y)}. It is convenient also to make estimates {bij{x^,Xj)} of the joint dis- 
tributions {P{xi,Xj\y)} of nodes i,j which are connected in the graph. (Again 
we use the convention i > j to avoid double counting). The {bj{xi)} can then be 
used to calculate approximations to the minimum variance (MV) estimators for 
the variables {xj}, see Section (6.1). 

The BP algorithm introduces variables rriij{xj) which correspond to “mes- 
sages” that node i sends to node j (in the next subsection we will see how these 
messages arise as Lagrange multipliers). The BP algorithm is given by: 

m,j{xj-,t + 1 ) = Ci.j'^i)^j{xi,Xj)i)^{xi) mkz{xi;t), ( 2 ) 

Xi 

where Cij is a normalization constant (i.e. it is independent of Xj). We use the 
convention that mki{xi,t) = 1 if nodes i,k are not connected. Nodes are not 
connected to themselves (i.e m^^{xi,t) = 1, Vi). 

The messages determine additional variables bi{xi),bij{x^,Xj) corresponding 
to the approximate marginal probabilities at node i and the approximate joint 
probabilities at nodes i, j (with convention i > j). These are given in terms of 
the messages by: 

6, (xj ; t) = (x, ) rufej (x, ; t) , (3) 

k 

6y(x„Xj;t) = c,j0,j(x,,Xj) mfej(x,;t) ]Jm(j(xj;t), (4) 



where = ipij{xi,Xj)'ipi{xi)tpj{xj) and Ci,Cij are normalization con- 

stants. 

For a tree, the BP algorithm of equation (2) is guaranteed to converge and 
the resulting {6,(xi)} will correspond to the posterior marginals [12]. 

The Bethe free energy of this system is written as [22] : 



F/ 3 {{b^j,b,}) = ^ ^ b,j{x^,Xj)\og 

Xi,Xj 



^(n, - 1)X] 6j(.x,)log 



bijix^^ Xj') 
4^ij (^^ 7 ) 

b^{x,) 






( 5 ) 



where Uj is the number of neighbours of node i. 

Because the {b^} and {b^j} correspond to marginal and joint probability 
distributions they must satisfy linear consistency constraints: 



b^J{x^,Xj) = 1, V i,j ■■ i > j y^6j(x,) = 1, V z, 

Xi,Xj Xi 

'^bij{xi,Xj) = bj{xj), V j,Xj, '^bij{xi,Xj) = bi{xi), Vz.Xi. 



Xi 



( 6 ) 
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The Bethe free energy consist of two terms. The first is of the form of a 
Kullback-Leibler (K-L) divergence between {bij} and {(pij} (but it is not actually 
a K-L divergence because {4>ij} is not a normalized distribution). The second is 
minus the form of the Kullback-Leibler divergence between {6j} and {V’l} (again 
is not a normalized probability distribution). It follows that the first term is 
convex in {6ij} and the second term is concave in {bi}. This will be of importance 
for the derivation of our algorithm in Section (3). 

Yedidia et al [22] proved that the fixed points of the BP algorithm correspond 
to extrema of the Bethe free energy (with the linear constraints of equation (6)). 

Yedidia et aVs work can be interpreted in terms of the dynamics of the dual 
energy function [24] where we extremize the Bethe Free Energy to express the 
{bi}, {bij} in terms of the Lagrange multiplers {A^j}, It can be shown that 

the messages {niij} can be expressed as simple functions of the Lagrange parme- 
ters. In this formulation Yedidia et aFs proof of convergence follows directly. 

3 Convexity and Concavity 

We now develop an algorithm that is guaranteed to converge to a minimum of 
the Bethe free energy. This section describes how we can exploit the fact that 
the Bethe free energy is the sum of a convex and a concave term, see Section (2). 

Our main results are given by Theorem’s 1,2,3 and show that we can obtain 
discrete iterative algorithms to minimize energy functions which are the sum of a 
convex and a concave term. (A similar result can be obtained using the Legendre 
transform, see [25]). We first consider the case where there are no constraints 
for the optimization. Then we generalize to the case where linear constraints are 
present. 

Theorem 1. Consider an energy funetion E{z) (bounded below) of form E[z) = 
Evex{z) + Ecave{z) where Eyex{z),Ecave{z) are convex and concave functions of 
z respectively. Then the discrete iterative algorithm z^ ^ given by: 

V£;,e.(z*+^) = -VE,ave{z^), (7) 

is guaranteed to monotonically decrease the energy E{) as a function of time and 
hence to converge to a minimum of E{z). 

Proof. The convexity and concavity of Ey^xf) o,nd Eyavei-) means that: 

Eyex{z 2 ^ ^ Eyex{.Z\) T (2^2 ^ ^ Eyyx{z\) 

Ecave{z4:) C: EyyyyiyZf) T (Z4 Z3) * EyyyylyZf) (S) 

for all zi, Z2, Z3, Z4. Now set zi = z*+^,Z 2 = z^,zs = z*,Z 4 = Using 

equation (8) and the algorithm definition (i.e. VK„ea:(^*^^) = —^Eyaveiz*)) 
we find that: 

Eyyx{z^^^) + Eyaveiz^^^) < Eyyx{z^) T Eyave{z*), (9) 



which proves the claim. 
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Theorem 1 generalizes previous results by Marcus, Waugh and Westervelt 
[9], [19] on the convergence of discrete iterated neural networks. (They discussed 
the case where one function was convex and the second function was linear). 

We can extend this result to allow for linear constraints on the variables z. 
This can be given a geometrical intuition. Firstly, properties such as concavity 
and concaveness are preserved when linear constraints are imposed. Secondly, 
because the constraints are linear they determine a hyperplane on which the 
constraints are satisfied. Theorem 1 can then be applied to the variables on this 
hyperplane. 

Theorem 2. Consider a function E{z) = Eyg^iz) + Ecave{z) subject to k linear 
eonstraints cf^-z = where {c^ : p, = 1, ..., k} are eonstants. Then the algorithm 
z* z*+^ given by: 

k 

VE,,,{Z^+^) = - V£ea.e(^‘) ~ ^ (10) 

fi=l 

where the parameters {a^} are chosen to ensure that ■ cj)^^ = for fi = 
l,...,k, is guaranteed to monotonieally decrease the energy E{z^) and henee to 
eonverge to a minimum of E{z). 

Proof. Intuitively the update rule has to balance the gradients of Ey^^, Ecave 
in the unconstrained direetions of z and the term ot^(f>^ is required to deal 

with the differences in the directions of the constraints. More formally, we define 
orthogonal unit vectors {xf'' : n = 1, ...,n~k} which span the space orthogonal to 
the constraints : /a = l,...,k}. Let y{z) = if' {z ■ 'if’') be the projection 

of z onto this space. Define functions Eyaveiv)) Ey^xiu) by: 

Ecave{y{^)) — Eyjyygi^z) , Eygxivi^)) ^ Eycxi^) . (H) 

Then we can use the algorithm of Theorem 1 on the unconstrained variables 
y = (yi, ..., y„_fc). By definition of y{z) we have dzfdyy = if'. Therefore the 
algorithm reduces to: 

if' -VzEyex{z^^^) = -if' -V^Ecaveiz^), V = \,...,n-k. (12) 

This gives the result (recalling that cjy^ ■ if' = 0 for all y, u). 

It follows from Theorem 2 that we only need to impose the constraints on 
Evex{z^^^) and not on Eyyyfzf. In other words, we set 

Eyex{z^+^) = Ey,,y{f+^) + ■ f+\ (13) 



and use update equations: 



dE.. 



vex /_t+l 



dz 



(z‘+^) 



dE, 



cave ^ ^t\ 



dz 



(14) 



where the coefficients {a^'} must be chosen to ensure that the constraints cf' 
^t+i _ satisfied. 
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We now restrict ourselves to the specific case where log . 

(This form of will arise when we apply Theorem 2 to the Bethe free 

energy, see Section (4)). We let h{z) = ^Ecavei^)- Then the update rules of 
Theorem 2, see equation (10), can be expressed as selecting 2 *+^ to minimize a 
convex cost function which, by duality, implies that the constraint 

coefficients {a^} can be found by maximizing a dual energy funetion. More for- 
mally, we have the following result: 

Theorem 3. Let iJ^ex(-z) = J2i log q > then the update equation of Theorem 2 
can be expressed as minimizing the convex energy function: 

^t+i(^t+i) ^ ^t+i -h + Y, zY log 



where h = ^ Ecave{z^) ■ By duality the solution corresponds to 



,t+i 






— hi 



-E„ 



(16) 



where the Lagrange multipliers {o^} are eonstrained to maximize the (concave) 
dual energy: 



S*+i(a) = (17) 

i /i 

Moreover, maximizing E*^^{a) with respect to a specific enables us to 
satisfy the corresponding constraint exactly. 

Proof. This is given by straightforward calculations. Differentiating £1^+^ with 
respect to z^ gives: 



1 + log^ = ( 18 ) 

whieh corresponds to the update equation (10) of Theorem 2. Since £'*+^(z) is 
convex, duality ensures that the dual is concave [18], and hence has 

a unique maximum which corresponds to the constraints being solved. Setting 
= 0 ensures that = c^, and henee satisfies the constraint. 

By using Theorem 3, we see that solving for the constraint coefficients in 
Theorem 2 can be reduced to maximizing a concave energy function in the 
{a’'}. 

Moreover, for the specific constraints used in the Bethe free energy we will 
be able to determine an efficient discrete iterative algorithms for solving for the 
constraints. This is by generalizing work by Kosowsky and Yuille [7], [8] who 
used a result similar to Theorem 3 to obtain an algorithm for solving the linear 
assignment problem. (Kosowsky and Yuille [7], [8] also showed that this result 
could be used to derive an energy function for the classic Sinkhorn algorithm 
[17] which converts positive matrices into doubly stochastic ones.) Rangarajan 
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et al [13] applied this result to obtain double loop algorithms for a range of 
optimization problems subject to linear constraints. 

In the next section we apply Theorems 2 and 3 to the Bethe free energy. In 
particular, we will show that the nature of the linear constraints for the Bethe 
free energy mean that solving for the constraints in Theorem 3 can be done 
efficiently. 



4 Constraints for the Bethe Free Energy 



In this section we return to the Bethe free energy and describe how we can 
implement an algorithm of the form given by Theorems 1,2 and 3. This double- 
loop algorithm is designed by splitting the Bethe free energy into convex and 
concave parts. An inner loop is used to impose the linear constraints. (This 
design was influenced by the work of Rangarajan et al [13]). 

First we split the Bethe free energy in two parts: 



Et! 



Xi,Xj 



4 ^ij {Xt , Xj ) 



E 



cave 



X’^^X bt{xt)\og 

^ Xi 



bjjxj) 

A{Xz) 



EE bt{xt)\og 

i Xi 



bijxi) 

i’li.Xt)' 



(19) 



This split enables us to get non-zero derivatives of Ey^^x with respect to both 
{bij} and {bi}. (Other choices of split are possible). 

We now need to express the linear constraints, see equation (6), in the form 
used in Theorems 2 and 3. To do this, we set z = {bij{xt,Xj),bt{xi)) so that 
the first components of z correspond to the {btj} and the later NM 

components to the {6,} (recall that there are N nodes of the graph each with 
M states). The dot product of z with a vector 0 = (Tij{xt,Xj), Ut{xi)) is given 

by Y.i,r-t>3 btj{xt,Xj)Ttj{xt,Xj) + E, bt{xt)Ut{xi). 

There are two types of constraints: (i) normalization - J2xp x, bpg{xp,Xq) = 
1 y p,q : p > q, and (ii) consistency - bpq{xp,Xg) = bq{xg) \/p,q,Xg : p > q 
and Ex, bpg{xp,Xq) = bp{xp) yq,p,Xp : p > q. 

We index the normalization constraints by pq (with p > q). Then we can 
express the constraint vectors the constraint coefficients and the 

constraint values by: 



0^® = (V<5«,0), = cP« = l, Vp,g. (20) 

The consistency constraints are indexed by pqXg (with p > q) and qpXp (with 
p > q). The constraint vectors, the constraint coefficients, and the constraint 
values are given by: 



= {5tp5jq5x^,Xp!~btq5x,,Xp), 
= {6tp5jq5x^Xp,-btp5x^,Xp), 



aP®®’ = \pg{Xq), 0"?^’ = 0, yp,q,Xg 
^qpxp ^ A,p(xp), = 0, yq,p,Xp. 



( 21 ) 
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We now apply Theorem 2 to the Bethe free energy and obtain: 

Theorem 4. The following update rule is guaranteed to reduce the Bethe free 
energy provided the constraint coefficients {dpq}, can be chosen to 

ensure that {b^j{t + 1 )}, {bi{t + 1 )} satisfy the linear constraints of equation (6): 

b^j{xi,Xj]t + 1 ) = 

bfx^;t + l) = ( 22 ) 

Vi[Xi) 



Proof. Substitute equation (19) into equation (10) and use equations (20,21) 
for the constraints. The constraint terms simplify owing to the form 

of the constraints (e.g. = hj)- 

Finally, we use Theorem 3 to obtain an algorithm which ensures that the con- 
straints are satisfied. Recall that finding the constraint coefficients is equivalent 
to maximizing the dual energy and that performing the maximization 

with respect to the coefficient corresponds to solving the constraint 
equation. For the Bethe free energy it is possible to solve the constraint 
equation to obtain an analytic expression for the pf^ coefficient in terms of the 
remaining coefficients. Therefore we can maximize the dual energy E^^^{a) with 
respect to any coefficient analytically. Hence we have an algorithm that is 
guaranteed to converge to the maximum of E^^^{a): select a constraint p, solve 
for the equation for analytically, and repeat. 

More formally: 

Theorem 5. The constraint coefficients { 7 ^ 5 }, {Ap^}, {A,p} of Theorem 4 can be 
solved for by a discrete iterative algorithm, indexed by t, guaranteed to converge 
to the unique solution. At each step we select coefficients jpq, Xpq{xq) or \qp{xp) 
and update them by: 



7p?(t+1) _ ^2 (j)pq{Xp,Xq)e E \c,p(xp-,T) ^ 



fp [xp)e^p ( E p ^ w(^p;r) 



(23) 



Proof. ITe use the update rule given by Theorem 4 and calculate the con- 
straint equations, z ■ = c^, V p. For the Bethe free energy we obtain equa- 

tions (23) where the upper equation corresponds to the normalization constraints 
and the lower equations to the consistency constraints. We observe that we 
can solve each equation analytically for the corresponding constraint coefficients 
')pq, Xpq{xq), Xqp{xp). By Thcorcm 3, this is equivalent to maximizing the dual en- 
ergy, see equation (17), with respect to each coefficient. Since the dual energy is 
concave solving for each coefficient jpq, Xpq{xq), Xqp{xp) (with the others fixed) is 
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guaranteed to inerease the dual energy. Henee we can maximize the dual energy, 
and hence ensure the constraints are satisfied, by repeatedly selecting coefficients 
and solving the equations (23). 

Observe that we can update all the coefficients {'jpq] simultaneously because 
their update rule (i.e. the right hand side of the top equation of 23) depends only 
on the {Aj,,}. Similarly, we can update many of the {Ap,} simultaneously because 
their update rules (i.e. the right hand sides of the middle and bottom equation of 
23) only depend on a subset of the {Ap,j}. (For example, when updating Ap,(x,) 
we can simultaneously update any A*j(xj) provided i p j and i ^ q j.) 

5 The Double-Loop Algorithm and BP 

In this section we summarize the double-loop algorithm and then show that it has 
some interesting similarities to the belief propagation algorithm as formulated 
by Yedidia et al [22]. 

The double-loop algorithm consists of an outer loop which implements equa- 
tion (22) plus an inner loop, given by equations (23) which imposes the con- 
straints. The double loop design of the algorithm was influenced by the work of 
Rangarajan et al [13]. 

A convergence proof and theoretical analysis of Rangarajan et ahs algorithm 
was given in [14]. It is not clear how many iterations of the internal loop are 
needed to ensure the constraints are satisfied (though techniques developed by 
[7], [8] and [14] might be used to bound the number of iterations). In practice, 
the convergence of these inner loops seems to go quickly [7], [8] [13]. 

The algorithm consists of an inner and an outer loop. The outer loop has 
(discrete) time parameter t and is given by: 

hj{x^,Xj-,t + 1) = (24) 

hfixp,t^l) = (25) 

and the inner loop to determine the {Tp?}, {Apg}, {A,jp} has time parameter r 
and is given by: 



h (x T ip ^P 



g2A,,K;r + l) ^ 

^<1 



Recall that the inner loop is required to impose linear constraints on the 
{bij},{b^}. For example, the update rule for Xpfixq) imposes the constraint 

bpq(Xp,Xq) — bq{Xq). 
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We now show that there are formal similarities between this algorithm and 
belief propagation as specified by equations (2,3, 4). 

We can relate the messages {rriij} of BP to lagrange multipliers {Aij} by 

mji{xi) = Sfc inverse = Y\k^imkj{xj) 

Vi,j,Xi,Xj (see [22], [24]). Using these relations, and comparing equations (4,24), 
we see that the update equations for the {b^j} are identical for the double- loop 
algorithm and for BP. (In fact, the {h^j} variables play no role in the dynamics 
of either algorithm and need not be evaluated at all until the algorithm has 
converged) . 

The main difference between the two algorithms is that the double-loop al- 
gorithm has an outer loop to update the {bi} (equation (25)) and an inner loop 
(equation (26)) to update the {Aj^, 7 ,^}. By contrast, in the BP algorithm all the 
dynamics (equation ( 2 )) can be expressed in terms of the messages or the 

lagrange parameters {A^} (which directly determine the {b^} by equation (3)). 

To go further, we express the BP update rules, equations (2,3), in terms of 
the lagrange multipliers. This gives: 



gA.A^,-;t+i) e "/-i Efc^'=Tx,-;t+i) ^ ^ 27 ) 

Xi 

b^{xi]t) = cV’i(x 7 )e“W^ El (-23) 



which we can compare to the equations for the double-loop algorithm: 

In, pEjvsp 

b^{x^■,t + l) = V’t(a;»)e~E"- { }"'eEfc ^ 

Ui(.u) 



(29) 

(30) 



where (as before) we use the variables r and t for the time steps in the in- 
ner and outer loops respectively. We also use the relationship 4>pq{xp,Xq) = 
i^pq{xp)Xq)ipp{xp)’tjjq{xq). (Wc ignore the “normalization dynamics” - updating 
the { 7 pq} - because it is straightforward and can be taken for granted. With this 
understanding we have dropped the dependence of the {'jpq} on r). 

Now suppose we try to approximate the double-loop algorithm by setting 
&i(a;.,;t) = cV'i(a:i)e~ Ei equation (28). This “collapses” the 
outer loop of the double-loop algorithm and so only the inner loop remains. 
This reduces equation (29) to: 



Apg (a^g ;t+ 1 ) 



X {e 



„„bi 



} 



= 'Y^'4’pq{Xp,Xq)iip{Xp)e T'f", 



(31) 



which is almost exactly the same as the BP algorithm, equation (27). (Use the 



identity {e 



-^E, 






E,-=ip 



■ Ej U<i(^g)g-Apg(a;,)). The 
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only difference is the factor {e 



-\pq(Xp-,T) J2j ^3liXq-,T) 



} is evaluated at time 



T from equation (31) and at time t + 1 for equation (27). 

It is also straightforward to check that equation (30) for updating the outer 

loop has b^{xi) = ^ fixed point. More precisely, if we 

converge to a state where Xtj(xj;T -|- 1) = \^j{xj;T) V i,j,Xj then substitut- 
ing b^{xi,t) = c'ijji{xi)e~ >^u(xz,t) equation (30) gives bi{xi,t + 1) = 

bi{xi; t) V i,Xi. 

To conclude, the BP algorithm can be thought of as an approximation to the 
double-loop algorithm. The approximation is done by assuming a fixed functional 
form for the {b^} in terms of the {Aj^}, thereby collapsing the outer loop of the 
algorithm, and modifying the update equations for the inner loop (by evaluating 
certain terms at r + 1 instead of at t. 



6 Discussion 

This section covers three additional topics which are essentially review mate- 
rial (i.e. no novel results are presented here). Firstly, we discuss the relationship 
of the Bethe approximation to MAP/MV estimation and how a temperature 
parameter can be introduced to obtain deterministic annealing. Secondly, we 
describe how the Bethe and mean field free energies can be obtained from an in- 
formation geometry viewpoint [1] using Kullback-Leibler divergences. Relations 
of the double- loop algorithm to mean field theory are described in [25]. 



6.1 MV and MAP Estimation and Deterministic Annealing 

We now briefly discuss the use of the Bethe approximations for estimation and 
how they can be modified to allow for temperature annealing [6]. For simplicity 
we concentrate only on the Bethe Free Energy. 

The input to a double-loop algorithm is a probability distribution function 
P{xi, ...,Xiv|y), see equation (1). The output of the algorithm is an estimate of 
the marginal and joint probability distributions. 

In many optimization problems it is desired to estimate the variables {xi} 
from P{xi, ...,X]\r\y). Two common estimators are the the Minimal Variance 
(MV) Estimator and the Maximum a Posteriori (MAP). From the estimates 
of the marginal probability distributions {bi{xi)}, provided by minimizing the 
Bethe free energy, it is possible to directly obtain an approximation to the MV 
estimate by computing Xi = Xibt{xi). 

The MAP estimate can be obtained by introducing a temperature parameter 
T. We replace the probability distribution P{x\, ..., x^ly) by {P{xi, x^ly)}^^'^ 
which implies, by equation (1), replacing {V'*}, by {V'* 

^ scaling by a factor T. This gives a family of Bethe free energies: 

Xi,Xj 
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^ ^ ^ ^ ^ij i.^1: 4^ij iS^ii ^j ) 

i,j-.i>j Xi,Xj 

-T 'Y^{n, - 1) ^ h{xi) log h{xi) 

i Xi 

+ 5^(n, -1)5^ h{xi)\og%pi{xi), (32) 

i Xi 

where the original Bethe Free Energy, see equation (5), is be obtained by set- 
ting T = 1. We see that equation (32) is equivalent to having energy terms 
- Eo:.,rr, (."E*, 2:^ ) log (a^*, ) + E* - 1) E:,. 6,(a:,) log V',(a:,) (in- 

dependent of temperature T) and temperature dependent entropy terms given 
by Eo:..x, b,j{xi,Xj) \ogb^j{xi,Xj)-TJ2^{n, - 1) E^. ^i(a;*)log h{x^). 

This is the same form for mean field annealing, see for example [7], [13]. 

Minimizing Fj gives a family of marginal distributions {bj{xi)}. As T i-^ 0 
the marginals {bf {x^)} will become sharply peaked about the most probable 
values X*’ = argmax^,. bj{xi). Therefore the estimates {x*’ } will become ap- 
proximations to the MAP estimates {a;*} = argmax^^j..} P(xi, x^vly)- (Pro- 
vided some technical conditions apply, see [23] for a theoretical analysis of what 
conditions are needed for the mean field theory approximation). 

In addition, for many optimization problems it may be desirable to have 
some form of temperature annealing to help avoid local minima. This can be 
achieved by calculating the estimates of the marginals {bj{xi)}, and the joint 
distribution {bfj{xi,Xj)}, at high values of the temperature T and using these 
as initializations for the estimates for smaller values of the T. See [23], [13] for 
examples of this approach and for further references. 

6.2 Kullback-Leibler Approximations and Information Geometry 

Amari’s theory of information geometry [1] gives a framework for obtaining 
approximations to probability distributions (where the approximations are re- 
stricted to have specific functional forms) . This gives some insight into the Bethe 
approximations. The material on the mean field theory approximation is related 
to Saul et al [16] and the Bethe free energy is similar to the derivation by Yedidia 
et al [22]. 

Let the target distribution be P{xi,...,xn) = ^Hij-.i>j'^ij{xi,Xj)Y[t'^i 
[x^,yi) (see equation (1)). Suppose we want to approximate it by a factorizable 
distribution Pp{xi, x^) = (with Ea, bt{x^) = 1 V i). One mea- 

sure of similarity, suggested by information geometry, is to seek the distribution 
Pp{xi, xn) which minimizes the Kullback-Leibler divergence to P{xi, ..., xn): 

DiPF\\P)= PF{xi,..;Xp,)log ^^^^^’-’^^l (33) 

We can re-express this as: 

D{Pp\\P) = - T.T. b,{x,)logtp,{x,) - EE bi{x,)bj{xj) log ^ij{xi,Xj) 

i Xi i,j XiyXj 
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log + log Z. (34) 

i Xi 

Minimizing D{P^'\\P) with respect to the {bi} (with the constraints that 
— 1 gives the mean field free energy used for many optimization 
applications, see for example [23], [13]. (The notation is different in those papers. 
For example to obtain the formulation used in [23] we relate the variables Xj 
to a,b and identify the variables |6i(xj)} with the matching variables {S'ia} 
and the terms — log 'tp^{xi) and — \ogtpij{xi,Xj) with terms —Aia and T^jab- The 
variables are constrained so that ^ *■) 

The Bethe free energy can be obtained in a similar way (although, as we 
will discuss, an additional approximation is needed if the graph has closed 
loops). We seek to approximate the distribution P{xi, ■■■,xn) by a distribution 
Pp{xi, ■■■,xn) which has marginals {bi{xi)} and joint distributions {bij{xi, Xj)}. 
Once again, we attempt to find the approximation Pp{xi, .... xn) which mini- 
mizes the Kullback-Leibler divergence: 

D{PMIP)= E 

Xl,...,XN \ / / / 

= ^ Pfi{xi,...,XN)\0gP)s{xi,...,XN) 

Xi,...,XN 

- ^ Pp{xi, ...,XN)logP{xi, ...,xpf)- (35) 

Xi,...,XN 



The second term of the Kullback-Leibler, see equation (35), can be expressed 
as - Ei Err. ^Og^p^{x,) - E,J Err..x,- log (x,, Xj) + log Z. 

The first term is the entropy of the approximating distribution Pp{xi , ..., xn). 
The Bethe approximation consists of assuming that we can express this as: 

F/3(xi,...,Xjv)logP,g(xi,...,X/v) 

Xl,...,XN 

» X] - - l)'^b,{x,)logb,{x.,). (36) 

i,j Xi,Xj i Xi 



It can be shown that this approximation is exact if the graph has no loops. 
(Recall that we only define joint distribution b^j{xi,Xj) for nodes i,j which 
are directly joined in the graph. Also is the number of neighbours). If the 
graph has no loops then the marginal and joint distributions (combined with 
the maximum entropy principle) determine a unique probability distribution 
(see, for example, [15]): 



Pp{xi, ...,xn) 



rii,j {Xt, Xj) 

I[i{h{xi)Y'-^' 



(37) 



In this case, the entropy can be computed directly to give equation (36). 
But if the graph has loops then Pp{xi, ...,xn) given by equation (37) is not a 
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normalized distribution. The Bethe approximation assumes that the entropy can 
be written as equation (36) even when the graph has closed loops. 

Using equation (36) for the entropy gives the result: 

d{p,\\p)^-J2J2 EE b,J{x^,XJ)\og'^p^J{x^,XJ) 

t Xi t,j Xi,Xj 

+EE b^j{xi,Xj)\og b^j{xi,Xj) 

ij Xi,Xj 

-Y^{n,-l)'^b,{x,)\ogb,{x,), (38) 

i Xi 

which is exact for graphs with no loops and an approximation otherwise. 

It can be quickly verified that the right hand side of equation (38) is the Bethe 
free energy (which involves using the relations (f>ij{xi,Xj) = ^ij{x^,Xj)^i{xt) 
ipj{xj) and the identity = E, ^*(^0 

logV’i(x,)). 

In summary, the mean field and the Bethe free energies can be obtained 
replacing the true distribution F(xi,...,x„) by approximations with probabil- 
ity distributions with specified marginal distributions (mean field) or specified 
marginal and joint distributions (Bethe). The Bethe free energy requires an ad- 
ditional approximation, about the entropy, if the graph has closed loops. The 
Kikuchi approximation can be obtained in a similar way. Essentially all these ap- 
proaches (mean field, Bethe, Kikuchi) seek to approximate the true distributions 
P{xi, ...,xtq) by distributions which are easier to compute with. 

7 Conclusion 

The aims of this paper were to analyze the BP algorithm of Yedidia et al [22] 
and to design a double-loop algorithm which can be proven to converge to a 
minimum of the Bethe free energy. It may also be helpful to determine when 
BP, and related algorithms, converge to the correct result - see [20] . 

More constructively, we derived a double-loop algorithm based on separating 
the Bethe free energy into concave and convex parts and then using a discrete it- 
erative algorithm to solve for the constraints. We showed that the BP algorithm 
is formally similar to the double-loop algorithm (and might be an effective ap- 
proximation in some cases). 

In other work, see [24]. we generalized our results to the Kikuchi free en- 
ergy. This yielded both a dual formulation of the free energy and a double-loop 
algorithm that can be proven to converge to a minimum of the Kikuchi free 
energy. 

Finally, we described how previous mean field theory work could be inter- 
preted within this framework. It seems that the differences between mean field 
theory and the Bethe methods is merely the degree of the approximation used 
(i.e. do we try to approximate the target distribution by a factorizable distri- 
bution - the mean field approach - or do we look for a higher order Bethe 
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approximation). Mean field theory algorithms can be very effective on difficult 
optimization problems (e.g., see [13]) provided the factorization assumptions 
is good enough but they also fail on other difficult problems such as the non- 
Euclidean Traveling Salesman Problem (Rangarajan - personal communication). 

Current work is also to implement the double-loop algorithm on computer 
vision problems and compare its performance to alternatives such as mean field 
theory algorithms. It has already been demonstrated [21] that the BP algorithms 
give good results in situations where mean field algorithms break down. 
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Abstract. Using first principles, we establish in this paper a connection 
between the maximum a posteriori (MAP) estimator and the variational 
formulation of optimizing a given functional subject to some noise con- 
straints. A MAP estimator which uses a Markov or a maximum entropy 
random field model for a prior distribution can be viewed as a minimizer 
of a variational problem. Using notions from robust statistics, a varia- 
tional filter called Huber gradient descent flow is proposed. It yields the 
solution to a Huber type functional subject to some noise constraints, 
and the resulting filter behaves like a total variation anisotropic diffu- 
sion for large gradient magnitudes and like an isotropic diffusion for small 
gradient magnitudes. Using some of the gained insight, we are also able 
to propose an information-theoretic gradient descent flow whose func- 
tional turns out to be a compromise between a neg-entropy variational 
integral and a total variation. Illustrating examples demonstrate a much 
improved performance of the proposed filters in the presence of Gaussian 
and heavy tailed noise. 



1 Introduction 

Linear filtering techniques abound in many image processing applications and 
their popularity mainly stems from their mathematical simplicity and their effi- 
ciency in the presence of additive Gaussian noise. A mean filter for example is 
the optimal filter for Gaussian noise in the sense of mean square error. Linear 
filters, however, tend to blur sharp edges, destroy lines and other fine image 
details, fail to effectively remove heavy tailed noise, and perform poorly in the 
presence of signal-dependent noise. This led to a search for nonlinear filtering 
alternatives. The research effort on nonlinear median-based filtering has resulted 
in remarkable results, and has highlighted some new promising research avenues 
[1]. On account of its simplicity, its edge preservation property and its robust- 
ness to impulsive noise, the standard median filter remains among the favorites 
for image processing applications [I]. The median filter, however, often tends to 
remove fine details in the image, such as thin lines and corners [1]. In recent 

* This work was supported by an AEOSR grant F49620-98-1-0190 and by ONR-MURI 
grant JHU-72798-S2 and by NCSU School of Engineering. 
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years, a variety of median-type filters such as stack filters, weighted median [1], 
and relaxed median [2] have been developed to overcome this drawback. In spite 
of an improved performance, these solutions were missing the regularizing power 
of a prior on the underlying information of interest. 

Among Bayesian image estimation methods, the MAP estimator using 
Markov or maximum entropy random field priors [3,4,5] has proven to be a 
powerful approach to image restoration. Among the limitations in using MAP 
estimation is the difficulty of systematically and easily /reliably choosing a prior 
distribution and its corresponding optimizing energy function, and in some cases 
the rsulting computational compexilty. 

In recent years, variational methods and partial differential equations (PDE) 
based methods [6,7] have been introduced to explicitly account for intrinsic ge- 
ometry in a variety of problems inluding image segmentation, mathematical 
morphology and image denoising. The latter will be the focus of the present 
paper. The problem of denoising has been addressed using a number of different 
techniques including wavelets [9], order statistics based filters [1], PDE’s based 
algorithms [6,10,11], and variational approaches [7]. A large number of PDE 
based methods have particularly been proposed to tackle the problem of image 
denoising with a good preservation of the edges. Much of the appeal of PDE- 
based methods lies in the availability of a vast arsenal of mathematical tools 
which at the very least act as a key guide in achieving numerical accuracy as 
well as stability. Partial differential equations or gradient descent flows are gen- 
erally a result of variational problems using the Euler-Lagrange principle [12]. 
One popular variational technique used in image denoising is the total variation 
based approach. It was developed in [10] to overcome the basic limitations of 
all smooth regularization algorithms, and a variety of numerical methods have 
also been recently developed for solving total variation minimization problems 
[10,13,14]. 

In this paper, we present a variational approach to MAP estimation. The 
key idea behind this approach is to use geometric insight in helping construct 
regularizing functionals and avoiding a subjective choice of a prior in MAP 
estimation. Using tools from robust statistics and information theory, we propose 
two gradient descent flows for image denoising to illustrate the resulting overall 
methodology. 

In the next section we briefly recall the MAP estimation formulation. In 
Section 3, we formulate a variational approach to MAP estimation. Section 4 is 
devoted to a robust variational formulation. In section 5, an entropic variational 
approach to MAP estimation is given. An improved entropic gradient descent 
flow is proposed in Section 6. Finally, in Section 7, we provide experimental 
results to show a much improved performance of the proposed gradient descent 
flows in image denoising. 
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2 Problem Formulation 

Consider an additive noise model for an observed image 

uo = w + ?7, (1) 

where u is the original image u : J? — ^ K, and f? is a nonempty, bounded, open 
set in (usually 17 is a rectangle in M^). The noise process rj is i.i.d., and uq 
is the observed image. The objective is to recover u, knowing uq and also some 
statistics of r). Throughout, x = (xi, X2) denotes a pixel location in 17, j • | denotes 
the Euclidean norm and 1 1 • 1 1 denotes the L^-norm. 

Many image denoising methods have been proposed to estimate u, and among 
these figure Bayesian estimation schemes [3], wavelet based methods [9], and 
PDE based techniques [6,10,11]. 

One commonly used Bayesian approach in image denoising is the maximum a 
posteriori (MAP) estimation method which incorporates prior information. De- 
note by p{u) the prior distribution for the unknown image u. The MAP estimator 
is given by 

u = argmax{logp(tto|u) +logp(u)}, (2) 

U 

where p{uo\u) denotes the conditional probability of uq given u. 

A general model for the prior distribution p(u) is a Markov random field 
(MRF) which is characterized by its Gibbs distribution given by [3] 




where Z in the partition function, A is a constant known as the temperature in 
the terminology of physical systems. Eor large A, the prior probability becomes 
flat, and for small A, the prior probability has sharp modes. IF is called the 
energy function and has the form J-{u) = J2cec'^c{u), where C denotes the 
set of cliques for the MRE, and V). is a potential function defined on a clique. 
Markov random fields have been extensively used in computer vision particularly 
for image restoration, and it has been established that Gibbs distributions and 
MRF’s are equivalent [3], in other words, if a problem can be defined in terms of 
local potentials then there is a simple way of formulating the problem in terms 
of MRF’s. 

If the noise process rj is i.i.d. Gaussian, then we have 
p{uo\u) = K exp ’ 

where if is a normalizing positive constant, and is the noise variance. Thus, 
the MAP estimator in (2) yields 

u = argmin |lF(u) + — uqP 

u 2 




( 3 ) 
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Image estimation using MRF priors has proven to be a powerful approach to 
restoration and reconstruction of high-quality images. A major problem limiting 
its utility is, however, the lack of a practical and robust method for systematically 
selecting the prior distribution. The Gibbs prior parameter A is also of particular 
importance since it controls the balance of influence of the Gibbs prior and that 
of the likelihood. If A is too small, the prior will tend to have an over-smoothing 
effect on the solution. Conversely, if it is too large, the MAP estimator may be 
unstable, reducing to the maximum likelihood solution as A goes to infinity. 

Another difficulty in using a MAP estimator is the non-uniqueness of the 
solution when the energy function T is not convex. 



3 A Variational Approach to MAP Estimation 



According to noise model (1), our goal is to estimate the original image u based 
on the observed image uq and on any knowledge of the noise statistics of ry. This 
leads to solving the following noise-constrained optimization problem 

min J-(u) 

't II l|2 2 (4) 

s.t. ||u — uoll = a 



where is a given functional which is often a criterion of smoothness of the 
reconstructed image. 

Using Lagrange’s theorem, the minimizer of (4) is given by 



u = arg min 

U 



+ 2 11 ^ 




( 5 ) 



where A is a nonnegative parameter chosen so that the constraint ||mo ^ 
is satisfied. In practice, the parameter A is often estimated or chosen a priori. 

Equations (3) and (5) show a close connection between image recovery via 
MAP estimation and image recovery via optimization of variational integrals. 
Indeed, Eq. (3) may be written in integral form as Eq. (5). 

A critical issue is the choice of the variational integral !F . The classical func- 
tionals (also called variational integrals) used in image denoising are the Dirichlet 
and the total variation integrals defined respectively as follows 

I iVupda;, and TV[u)= j |Vu|da:, (6) 

2 .Jn J n 

where Vu stands for the gradient of the image u. 

A generalization of these functionals is the variational integral given by 

jT(u) = [ F{\Vu\)dx, (7) 

Jn 

where F : K+ — )■ K is a given smooth function called variational integrand or 
Lagrangian [12]. 
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The total variation method [10] basically consists in finding an estimate u 
for the original image u with the smallest total variation among all the images 
satisfying the noise constraint ||tt — uo|P = cr^, where a is assumed known. Note 
that the regularizing parameter A controls the balance between minimizing the 
term which corresponds to fitting the data, and minimizing the regularizing term. 
The intuition behind the use of the total variation integral is that it incorportes 
the fact that discontinuities are present in the original image u (it measures the 
jumps of u, even if it is discontinuous). The total variation method has been used 
with success in image denoising, especially for denoising images with piecewise 
constant features while preserving the location of the edges exactly [15,16]. 

Using Eq. (7), we define the following functional 

£{u) = T{u) + ^\\u - uof = I (^F{\Vu\) + ^\u - dx. (8) 
Thus, the optimization problem (5) becomes 

u = argmin C{u) = arg min | F{u) + ^ ||u — uqIP 1 j (9) 

uEX u^X 2 J 

where X is an appropriate image space of smooth functions like C^{f2), or the 
space BV (J7) of image functions with bounded variation^, or the Sobolev space^ 
Tfi(j7) = 1U1>2(12). 



3.1 Properties of the Optimization Problem 

A problem is said to be well-posed in the sense of Hadamard if (i) a solution of 
the problem exists, (ii) the solution is unique, (iii) and the solution is stable, i.e. 
depends continuously on the problem data. It is ill-posed when it fails to satisfy 
at least one of these criteria. To guarantee the well-posedness of our minimization 
problem (9), the following result provides some conditions. 

Theorem 1. Let the image spaee X be a reflexive Banaeh space, and let T be 

(i) weakly lower semicontinuous, i.e. if for any sequence (u*) in X converging 
weakly to u, we have J-{u) < liminffe_>oo 

(ii) coercive, i.e. J-{u) oo as Ijuj] — >■ cx). 

Then the functional L is bounded from below and possesses a minimizer, i.e. 
there exists u E X such that C{u) = mix T. Moreover, if X is convex and A > 0, 
then the optimization problem (9) has a unique solution, and it is stable. 

Proof. From (i) and (ii) and the weak lower semicontinuity of the L^-norm, the 
functional C is weak lower semicontinuous, and coercive. 

^ BV {ft) = {w e L^(J7) : TV (u) < cxs} is a Banach space with the norm ||w||sv = 
l^lUqn) + TV(u). 

^ H^{n) = {u € : Vw e L^{fi)} is a Hilbert space with the norm = 

||w||2 + ||Vw||^ 
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Let u" be a minimizing sequence of C, i.e. > inf^ An immediate conse- 

quence of the coercivity of C is that -u" must be bounded. As X is reflexive, thus 
u” converges weakly to u in X, i.e u" ^ u. Thus T(u) < lim inf„_^oo = 
infx jC. This proves that C{u) = infx C. 

It is easy to check that convexity implies weakly lower semicontinuity. Thus the 
solution of the optimization problem (9) exists and it is unique because the L^- 
norm is strictly convex. The stability follows using the semicontinuity of £ and 
the fact that u" is bounded. □ 



3.2 Gradient Descent Flows 

To solve the optimization problem (9), a variety of iterative methods may be 
applied such as gradient descent [10], or fixed point method [13,16]. 

The most important first-order necessary condition to be satisfied by any 
minimizer of the functional £ given by Eq. (8) is that its first variation S£{u; v) 
vanishes at u in direction of v, that is 

6£(u;v) = —£(u + ev) =0, (10) 

at 

€=0 

and a solution u of (10) is called a weak extremal of £ [12]. 

Using the fundamental lemma of the calculus of variations, relation (10) 
yields the Euler-Lagrange equation as a necessary condition to be satisfied by 
minimizers of £. This Euler-Lagrange equation is given by 

-V- +X(u-uo) = Q, in 17, (11) 

V |Vu| J 

with homogeneous Neumann boundary conditions. An image u satisfying (11) 
(or equivalently V£(u) = 0) is called an extremal of £. 

Note that |Vu| is not differentiable when Xu = 0 (e.g. fiat regions in the 
image u). To overcome the resulting numerical difficulties, we use the following 
slight modification 

|Vu|, = V|Vu]2 + e, 
where e is positive sufficiently small. 

By further constraining A, we may be in a position to sharpen the properties 
of the minimizer, as given in the following. 

Proposition 1. Let A = 0, and S be a convex set of an image space X. If the 
Lagrangian F is nonegative convex and of class such that F'{0) > 0, then 
the global minimizer of £ is a constant image. 

Proof. Since F is convex and F'(0) > 0, it follows that F{\Xu\) > F{0). Thus 
the constant image is a minimizer of £. Since S is convex, it follows that this 
minimizer is global. □ 
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Using the Euler-Lagrange variational principle, the minimizer of (9) can be 
interpreted as the steady state solution to the following nonlinear elliptic PDE 
called gradient descent flow 

Ut = V ■ {g{\'Vu\)'S/u) — \{u — uq), in 17 X M+, (12) 

where g{z) = F'{z)jz, with z > 0, and assuming homogeneous Neumann bound- 
ary conditions. 

The following examples illustrate the close connection between minimizing 
problems of variational integrals and boundary value problems for partial differ- 
ential equations in the case of no noise constraint (i.e. setting A = 0): 

(a) Heat equation: ut = Au is the gradient descent flow for the Dirichlet varia- 
tional integral 



T){u) [ \Vufdx, (13) 

2 Jn 

(b) Perona-Malik PDE: It has been shown in [17] that the anisotropic diffusion 
PDE of Perona and Malik [6] given by = V ■ ( 5 (|Vu|)Vu), is the gradient 
descent flow for the variational integral 

Fc{u) = f Fci\Vu\)dx, 

Jn 



with Lagrangians 

^c(2) = ylog (^1 + or Fe(z) = ^ (^1 -exp , 

where z G and c is a tuning positive constant. 

The key idea behind using the diffusion functions proposed by Perona and 
Malik is to encourage smoothing in flat regions and to diffuse less around 
the edges, so that small variations in the image such as noise is smoothed 
and edges are preserved. However, it has been noted that the anisotropic 
diffusion is ill-posed and it becomes well-posed under certain conditions on 
the diffusion function or equivalently on the Lagrangian [17]. 

(c) Curvature flow: = V ■ (j^)]j) corresponds to the total variation integral 

TV{u) = [ \Vu\dx. (14) 

Jn 

4 Robust Variational Approach 

Robust estimation addresses the case where the distribution function is in fact 
not precisely known [18,9]. In this case, a reasonable approach would be to 
assume that the density is a member of some set, or some family of parametric 
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families, and to choose the best estimate for the least favorable member of that 
set. Huber [18] proposed an e-contaminated normal set defined as 

V, = {{1 - e)<P + eH : H G S}, 



where is the standard normal distribution, S is the set of all probability distri- 
butions symmetric with respect to the origin (i.e. such that H{—x) = 1 — H{x)) 
and e e [0, 1] is the known fraction of “contamination”. Huber found that the 
least favorable distribution in which maximizes the asymptotic variance (or, 
equivalently, minimizes the Fisher information) is Gaussian in the center and 
Laplacian in the tails, and switches from one to the other at a point whose value 
depends on the fraction of contamination e, larger fractions corresponding to 
smaller switching points and vice versa. 

For the set of e-contaminated normal distributions, the least favorable 
distribution has a density function given by [18,9] 



fii{z) = ^^-^exp(-pfc(z)), 
yj Z-n 

where is the Huber M-estimator cost function given by 

[ ^ \z\ < k 

Pk{z) = S ^ ^2 

[ fc]^] — ^ otherwise 



(15) 



(16) 



and A; is a positive constant related to the fraction of contamination e by the 
equation [18] 



2 






(17) 



where (p is the standard normal distribution function and f is its probability 
density function. It is clear that pk is a convex function, quadratic in the center 
and linear in the tails. 

Motivated by the robustness of the Huber M-filter in the probabilistic ap- 
proach of image denoising [1] , we define the Huber variational integral as 

^/c(w)= [ /?/c(|Vti|)dx. (18) 

It is worth noting that the Huber variational integral is a hybrid of the Dirichlet 
variational integral {pk{\Vu\) oc |Vup/2 as fc ^ oo) and the total variation 
integral (/9fc(|Vu|) oc jVuj as k ^ 0). One may check that the Huber variational 
integral TZk ■ M+ is well defined, convex, and coercive. It follows from 

Theorem 1 that the minimization problem 

u = arg min / ( pk{\'S7u\) + —\u — uo\] dx (19) 

u€H^n)jQ\ 2 J 

has a solution. This solution is unique when A > 0 
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Proposition 2. The optimization problem (19) is equivalent to 



u = arg mm 

(«,6l)e//Vt2)xR 



2 



\wu\-e 



^1 

+ 2 1 “ “ “0 



dx^ . 



( 20 ) 



Proof. For z fixed, define T{9) = ^9^ + k\z — 9\ on M. It is is clear that \[' is 
convex on K. It follows that T attains its minimum at 9q such that 'T'{9q) = 0 
and T"{9q) > 0, that is 9q = ksign{z — k). Thus we have 



<F(0o) = 



kf 

kz — — a z > k 
z^ ^ 

— if z = k 

k(^ 

—kz — ^ if z < —k 



It follows that Pk{z) = argmin0gK'^'(0). This concludes the proof. □ 

Using the Euler-Lagrange variational principle, it follows that the Huber 
gradient descent flow is given by 



Wt = V • (fffcdVuDVu) - A(u - Uo), in 1? X K+, (21) 



where pk is the Huber M-estimator weight function [18] 



3k{z) = 




1 if \z\ < k 
k 

— otherwise 

1^1 



and with homogeneous Neumann boundary conditions. 

For large k, the Huber gradient descent flow results in the isotropic diffusion 
(heat equation when A = 0), and for small k, it corresponds to total variation 
gradient descent flow (curvature flow when A = 0). 

It is worth noting that in the case of no noise constraint (i.e. setting A = 0), 
the Huber gradient descent flow yields the robust anisotropic diffusion [19] ob- 
tained by replacing the diffusion functions proposed in [6] by robust M-estimator 
weight functions [18,1]. 

Chambolle and Lions [14] proposed the following variational integral 



where the Lagrangian is defined as 

,2 



^e{u) = / 0e(|Vu|)da;, 



if Izj < e 



Mz) 



z 

Yt 



2 2e I I - e 



( 22 ) 



As mentioned in [14], the case | 2 | > 1/e may be dropped when dealing with 
discrete images since in this case the image gradients have bounded magnitudes. 
Hence, the variational integral becomes the Huber variational integral 7?.^. 
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5 Entropic Variational Approach 



The maximum entropy criterion is an important principle in statistics for mod- 
elling the prior probability p{u) of an unknown image u, and it has been used 
with success in numerous image processing applications [4], Suppose the available 
information by way of moments of some known functions r = 1, . . . , s. 

The maximum entropy principle suggests that a good choice of the prior prob- 
ability p{v) is the one that has the maximum entropy or equivalently has the 
minimum negentropy [20] 



min j p{u)logp{u)du 

U 

s.t. / p{u)du = 1 

f rrir{u)p{u)du = r = 1, . . . , s 

Using Lagrange’s theorem, the solution of (23) is given by 

P(u) = ^ exp I - ^ (u) | , 



(23) 



(24) 



where A^’s are the Lagrange multipliers, and Z is the partition function. Thus, 
the maximum entropy distribution p{u) given by Eq. (24) can be used as a model 
for the prior distribution in MAP estimation. 



5.1 Entropic Gradient Descent Flow 

Motivated by the good performance of the maximum entropy method in the 
probabilistic approach to image denoising, we define the negentropy variational 
integral as 




jV'uj log \Vu\dx, 



where H{z) = .zlog( 2 ;), z > 0. Note that -H{z) ^ 0 as .z 0. 
It follows from the inequality z\og{z) < z^/2 that 



(25) 



\'H{u)\ < J \Vu\‘^dx < ||u||^i(^ 2 ) < oo, Vu G 

Thus the negentropy variational integral 'H : H^{Q) — ^ K is well defined. Clearly, 
the Lagrangian H is strictly convex, and coercive, i.e. H{z) ^ oo as \z\ oo. 
The following result follows from Theorem 1. 

Proposition 3. Let A > 0. The minimization problem 



u = arg min 

ueH^{n) 



^|Vu|log|Vu| + ^\u 




dx 



has a unique solution provided that |Vu| > 1. 




A Variational Approach to Maximum 



29 



Using the Euler-Lagrange variational principle, it follows that the entropic gra- 
dient descent flow is given by 

= V ■ ^ I I ~ ~ ^ ®+’ 

with homogeneous Neumann boundary conditions. 

Proposition 4. Let u be an image. The negentropy variational integral and the 
total variation satisfy the following inequality 



n{u) > TV{u) - 1. 



Proof. Since the negentropy if is a convex function, the Jensen inequality yields 




Vu\)dx > 



H 




\Vu\dx 



= h(tV{u)) 

= TV{u)\ogTV{u), 



and using the inequality z log(2) >2 — 1 for 2 > 0, we conclude the proof. □ 




Fig. 1. Visual comparison of some variational integrands. 
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5.2 Improved Entropic Gradient Descent Flow 

Some variational integrands discussed in this paper are plotted in Fig. 1. From 
these plots, one may define a hybrid functional between the negentropy varia- 
tional integral and the total variation as follows 

iUiA = I ® 

nyu) I T^(u) otherwise. 



Note that the functional T-L is not differentiable when the Euclidean norm of Vu 
is equal to e ( i.e. Euler number: e = lim„_,.oo(l + 1/n)" « 2.71). This difficulty 
is overcome if we replace 77 with the following functional T-Ltv defined as 



Jirviu) = 



HTv{\^u\)da 



n{u) if |Vu| < e 

2 TV{u) — |f?|e otherwise 



(27) 



where Htv '■ — )• K is defined as 



Htv{z) 



Z log(2) if 2: < e 
2z — e otherwise 



and |f?| denotes the Lebesque measure of the image domain f?. In the numerical 
implementation of our algorithms, we may assume without loss of generality 
that Q — (0,1) X (0,1), so that |f?| = 1. Note that T-Ltv '■ 77^(1?) K is well 
defined, differentiable, weakly lower semicontinuous, and coercive. It follows from 
Theorem 1 that the minimization problem 



u = arg min 

u€H^{S2) 




A 

2 



\\u - Uoll^ 



(28) 



has a solution. 

Using the Euler-Lagrange variational principle, it follows that the improved 
entropic gradient descent flow is given by 

ut = V- - A(u - uo), in 12 X M+, (29) 

with homogeneous Neumann boundary conditions. 

6 Simulation Results 

This section presents simulation results where Huber, entropic, total variation 
and improved entropic gradient descent flows are applied to enhance images 
corrupted by Gaussian and Laplacian noise. 

The performance of a filter clearly depends on the filter type, the properties 
of signals/images, and the characteristics of the noise. The choice of criteria by 
which to measure the performance of a filter presents certain difficulties, and 
only gives a partial picture of reality. To assess the performance of the proposed 
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denoising methods, a mean square error (MSE) between the filtered and the 
original image is evaluated and used as a quantitative measure of performance of 
the proposed techniques. The regularization parameter (or Lagrange multiplier) 
A for the proposed gradient descent flows is chosen to be proportional to signal- 
to-noise ratio (SNR) in all the experiments. 

In order to evaluate the performance of the proposed gradient descent flows 
in the presence of Gaussian noise, the image shown in Fig. 2(a) has been cor- 
rupted by Gaussian white noise with SNR = 4.79 db. Fig. 2 displays the results 
of filtering the noisy image shown in Fig. 2(b) by Huber with optimal k = 1.345, 
entropic, total variation and improved entropic gradient descent flows. Quali- 
tatively, we observe that the proposed techniques are able to suppress Gaus- 
sian noise while preserving important features in the image. The resulting mean 
square error (MSE) computations are also tabulated in Fig. 2. 

The Laplacian noise is somewhat heavier than the Gaussian noise. Moreover, 
the Laplace distribution is similar to Huber’s least favorable distribution [9] (for 
the no process noise case), at least in the tails. To demonstrate the application of 
the proposed gradient descent flows to image denoising, qualititive and quantita- 
tive comparisons are performed to show a much improved performance of these 
techniques. Fig. 3(b) shows a noisy image contaminated by Laplacian white noise 
with SNR = 3.91 db. The MSE’s results obtained by applying the proposed tech- 
niques to the noisy image are shown in Fig. 3 with the corresponding filtered 
images. Note that the improved entropic gradient descent flow outperforms the 
other flows in removing Laplacian noise. Comparison of these images clearly in- 
dicates that the improved entropic gradient descent flow preserves well the image 
structures while removing heavy tailed noise. 
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Gradient descent flows 
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Entropic 
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Total Variation 
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Improved Entropic 
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(a) Original image 



(b) Noisy image SNR = 4.79 db 





(e) Total Variation flow 



(f) Improved Entropic flow 



Fig. 2. Filtering results for Gaussian noise. 
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Gradient descent flows 
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(a) Original image (b) Noisy image SNR = 3.91 db 




(c) Huber flow 



(d) Entropic flow 




(e) Total Variation flow 



(f) Improved Entropic flow 



Fig. 3. Eiltering results for Laplacian noise. 
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Abstract. Motion segmentation methods often fail to detect the mo- 
tions of low textured regions. We develop an algorithm for segmentation 
of low textured moving objects. While usually current motion segmenta- 
tion methods use only two or three consecutive images our method refines 
the shape of the moving object by processing successively the new frames 
as they become available. We formulate the segmentation as a parame- 
ter estimation problem. The images in the sequence are modeled taking 
into account the rigidity of the moving object and the occlusion of the 
background by the moving object. The segmentation algorithm is derived 
as a computationally simple approximation to the Maximum Likelihood 
estimate of the parameters involved in the image sequence model: the 
motions, the template of the moving object, its intensity levels, and the 
intensity levels of the background pixels. We describe experiments that 
demonstrate the good performance of our algorithm. 



1 Introduction 

The segmentation of an image into regions that undergo different motions has 
received the attention of a large number of researchers. According to their re- 
search focus, different scientific communities addressed the motion segmentation 
task from distinct viewpoints. 

Several papers on image sequence coding address the motion segmentation 
task with computation time concerns. They reduce temporal redundancy by 
predicting each frame from the previous one through motion compensation. See 
reference [16] for a review on very low bit rate video coding. Regions undergo- 
ing different movements are compensated in different ways, according to their 
motion. The techniques used in image sequence coding attempt to segment the 
moving objects by processing only two consecutive frames. Since their focus is 
on compression and not in developing a high level representation, these efforts 
have not considered low textured scenes, and regions with no texture are con- 
sidered unchanged. As an example, we applied the algorithm of reference [8] to 
segmenting a low textured moving object. Two consecutive frames of a traffic 
road video clip are shown in the left side of Figure 1. In the right side of Fig- 
ure 1, the template of the moving car was found by excluding from the regions 
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Fig. 1. Motion segmentation in low texture. 



that changed between the two co-registered frames the ones that correspond to 
uncovered background areas, see reference [8] . The small regions that due to the 
noise are misclassified as belonging to the car template can be discarded by an 
adequate morphological post-processing. However, due to the low texture of the 
car, the regions in the interior of the car are misclassified as belonging to the 
background, leading to a highly incomplete car template. 

High level representation in image sequence understanding, such as layered 
models [14,18,3], has been considered in the computer vision literature. Their ap- 
proach to motion-based segmentation copes with low textured scenes by coupling 
motion-based segmentation with prior knowledge about the scenes as in statisti- 
cal regularization techniques, or by combining motion with other atributtes. For 
example, reference [7] uses a Markov Random Field (MRF) prior and a Bayesian 
Maximum a Posteriori (MAP) criterion to segment moving regions. The authors 
suggest a multiscale MRF modeling to resolve large regions of uniform intensity. 
In reference [9], the contour of a moving object is estimated by fusing motion 
with color segmentation and edge detection. In general, these methods lead to 
complex and time consuming algorithms. 

References [10,11] describe one of the few approaches using temporal integra- 
tion by averaging the images registered according to the motion of the different 
objects in the scene. After processing a number of frames, each of these inte- 
grated images is expected to show only one sharp region corresponding to the 
tracked object. This region is found by detecting the stationary regions between 
the corresponding integrated image and the current frame. Unless the back- 
ground is textured enough to blur completely the averaged images, some regions 
of the background can be classified as stationary. In this situation, their method 
overestimates the template of the moving object. This is particularly likely to 
happen when the background has large regions with almost constant color or 
intensity level. 



1.1 Proposed Approach 

We formulate image sequence analysis as a parameter estimation problem by us- 
ing the analogy between a communications system and image sequence analysis, 
see references [15] and [17]. The segmentation algorithm is derived as a computa- 
tionally simple approximation to the Maximum Likelihood (ML) estimate of the 
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parameters involved in the two-dimensional (2D) image sequence model: the mo- 
tions, the template of the moving object, its intensity levels (the object texture), 
and the intensity levels of the background pixels (the background texture). The 
joint ML estimation of the complete set of parameters is a very complex task. 
Motivated by our experience with real video sequences, we decouple the esti- 
mation of the motions (moving objects and camera) from that of the remaining 
parameters. The motions are estimated on a frame by frame basis and then used 
in the estimation of the remaining parameters. Then, we introduce the motion 
estimates into the ML cost function and minimize this function with respect to 
the remaining parameters. 

The estimate of the object texture is obtained in closed form. To estimate the 
background texture and the moving object template, we develop a fast two-step 
iterative algorithm. The first step estimates the background for a fixed template 
- the solution is obtained in closed form. The second step estimates the template 
for a fixed background - the solution is given by a simple binary test evaluated 
at each pixel. The algorithm converges in a few iterations, typically three to five 
iterations. 

Our approach is related to the approach of references [10,11], however, we 
model explicitly the occlusion of the background by the moving object and we 
use all the frames available rather than just a single frame to estimate the moving 
object template. Even when the moving object has a color very similar to the 
color of the background, our algorithm has the ability to resolve accurately the 
moving object from the background, because it integrates over time those small 
differences. 



1.2 Paper Organization 

In section 2, we state the segmentation problem. We define the notation, develop 
the observation model, and formulate the ML estimation. In section 3, we detail 
the two-step iterative method that minimizes the ML cost function. In section 4, 
we describe two experiments that demonstrate the performance of our algorithm. 
Section 5 concludes the paper. 

For the details not included in this paper, see reference [1]. A preliminary 
version of this work was presented in reference [2]. 

2 Problem Formulation 

We discuss motion segmentation in the context of Generative Video (GV), see 
references [13,14]. GV is a framework for the analysis and synthesis of video 
sequences. In GV the operational units are not the individual images in the 
original sequence, as in standard methods, but rather the world images and 
the ancillary data. The world images encode the non-redundant information 
about the video sequence. They are augmented views of the world - background 
world image - and complete views of moving objects - figure world images. The 
ancillary data registers the world images, stratifies them at each time instant, 
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and positions the camera with respect to the layering of world images. The world 
images and the ancillary data are the GV representation, the information that 
is needed to regenerate the original video sequence. We formulate the moving 
object segmentation task as the problem of generating the world images and 
ancillary data for the GV representation of a video clip. 



2.1 Notation 

An image is a real function defined on a subset of the real plane. The image 
space is a set {I : "D ^ 7^}, where I is an image, T> is the domain of the image, 
and TZ is the range of the image. The domain is a compact subset of the real 
plane and the range 72. is a subset of the real line M. Examples of images are 
the frame / in the video sequence, denoted by I/, the background world image, 
denoted by B, the moving object world image, denoted by O, and the moving 
object template, denoted by T. The images If, B, and O have range 72. = K. 
They code intensity gray levels^. The template of the moving object is a binary 
image, i.e., an image with range 72. = {0, 1}, defining the region occupied by the 
moving object. The domain of the images 1/ and T is a rectangle corresponding 
to the support of the frames. The domain of the background world image B is a 
subset V of the plane whose shape and size depends on the camera motion, i.e., 
T> is the region of the background observed in the entire sequence. The domain T> 
of the moving object world image is the subset of where the template T takes 
the value 1, i.e., V = {{x, y) : T{x, y) = 1}. 

In our implementation, the domain of each image is rectangular shaped with 
size fitting the needs of the corresponding image. Although we use a continuous 
spatial dependence for commodity, in practice the domains are discretized and 
the images are stored as matrices. We index the entries of each of these matrices 
by the pixels [x, y) of each image and refer to the value of image I at pixel {x, y) 
as l{x,y). Throughout the text, we refer to the image product of two images A 
and B, i.e., the image whose value at pixel [x,y) equals A(x, ?/)B(x, y), as the 
image AB. Note that this product corresponds to the Hadamard product, or 
elementwise product, of the matrices representing images A and B, not their 
matrix product. 

We consider two-dimensional (2D) parallel motions, i.e., all motions (trans- 
lations and rotations) are parallel to the camera plane. We represent this kind of 
motions by specifying time varying position vectors. These vectors code rotation- 
translation pairs that take values in the group of rigid transformations of the 

^ The intensity values of the images in the video sequence are positive. In our exper- 
iments, these values are coded by a binary word of eight bits. Thus, the intensity 
values of a gray level image are in the set of integers in the interval [0,255]. For 
simplicity, we do not take into account the discretization and the saturations, i.e., 
we consider the intensity values to be real numbers and the gray level images to 
have range 72 = R. The analysis in the thesis is easily extended to color images. 
A color is represented by specifying three intensities, either of the perceptual at- 
tributes brightness, hue, and saturation-, or of the primary colors red, green, and blue, 
see reference [12]. The range of a color image is then 72 = M®. 
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plane, the special Euclidean group SE(2). The image obtained by applying the 
rigid motion coded by the vector p to the image I is denoted by AJ(p)I. The im- 
age AJ(p)I is also usually called the registration of the image I according to the 
position vector p. The entity represented by M{p) is seen as a motion operator. 
In practice, the {x, y) entry of the matrix representing the image AI(p)I is given 
by M{p)l{x,y) = I(^(p; x, y), (p; x, y)) where fx{p]X,y) and /y(p;x,y) rep- 
resent the coordinate transformation imposed by the 2D rigid motion. We use 
bilinear interpolation to compute the intensity values at points that fall in be- 
tween the stored samples of an image. 

The motion operators can be composed. The registration of the image Ad(p)I 
according to the position vector q is denoted by AI(qp)I. By doing this we are 
using the notation qp for the composition of the two elements of SE(2), q and p. 
We denote the inverse of p by p"^, i.e., the vector p"^ is such that when composed 
with p we obtain the identity element of SE(2). Thus, the registration of the 
image Ad(p)I according to the position vector p^ obtains the original image I, 
so we have = Ad(pp^)I = I. Note that, in general, the elements 

of SE(2) do not commute, i.e., we have qp ^ pq, and AI(qp)I Ad(pq)I. 
Only in special cases is the composition of the motion operators not affected 
by the order of application, as for example when the motions p and q are pure 
translations or pure rotations. 

The notation for the position vectors involved in the segmentation problem is 
as follows. The vector p/ represents the position of the background world image 
relative to the camera in frame /. The vector qy represents the position of the 
moving object relative to the camera in frame /. 



2.2 Observation Model 



The observation model considers a scene with a moving object in front of a 
moving camera with two-dimensional (2D) parallel motions. The pixel (x,y) of 
the image ly belongs either to the background world image B or to the object 
world image O. The intensity Iy(x, y) of the pixel (x.y) is modeled as 



^f{x,y) = M{pf)B{x,y) 



1 - M{c[*)T{x,y) 



+M{tif)0{x,y) M{tif)T{x,y) + W f{x,y). 



( 1 ) 



In equation (1), T is the moving object template, py and qy are the camera 
pose and the object position, and W y stands for the observation noise, assumed 
Gaussian, zero mean, and white. 

Equation (1) states that the intensity of the pixel (x,y) on frame /, Iy(x,y), 
is a noisy version of the true value of the intensity level of the pixel (x, y). If the 
pixel (x, y) of the current image belongs to the template of the object, T, after 
the template is compensated by the object position, i.e., registered according to 
the vector qy , then AJ(qy )T(x,y) = 1. In this case, the first term of the right 
hand side of (1) is zero, while the second term equals AJ(qy )0(x, y), the inten- 
sity of the pixel (x, y) of the moving object. In other words, the intensity ly (x, y) 
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equals the object intensity M.{q^)0{x,y) corrupted by the noise Wf{x,y). On 
the other hand, if the pixel (x, y) does not belong to the template of the object, 
Ad(qf)T {x,y) = 0, and this pixel belongs to the background world image B, 
registered according to the inverse of the camera position. In this case, the in- 
tensity I/(x, y) is a noisy version of the background intensity A^(p^ )B(x, y). We 
want to emphasize that rather than modeling simply the two different motions, 
as usually done when processing only two consecutive frames, expression (1) 
models the occlusion of the background by the moving object explicitly. 

Expression (1) is rewritten in compact form as 



= {-M(P/)B [l -M(qf)T] +M(q^)OM(qf)T + W/} 



H, 



( 2 ) 



where we assume that I/(x, y) = 0 for (x,y) outside the region observed by the 
camera. This is taken care of in equation (2) by the binary image H whose (x, y) 
entry is such that H(x,y) = 1 if pixel (x,y) is in the observed images 1/ 
or H(x, y) = 0 if otherwise. The image 1 is constant with value 1. 



2.3 Maximum Likelihood Estimation 

Given F frames {!/,!</< F}, we want to estimate the background world 
image B, the object world image O, the object template T, the camera poses 
{P/>1 < / < F}, and the object positions {q/,1 < / < F}. The quantities 
{B, O, T, {pj} , {q/}} define the GV representation, the information that is 
needed to regenerate the original video sequence. 

Using the observation model of expression (2) and the Gaussian white noise 
assumption, ML estimation leads to the minimization over all GV parameters 
of the functional^ 



^2 = 1 1 y) ~ -^(P/ )B(a;, y) 1 - M{qf)T{x, y) 

-M(q^)0(x,y) Ad(q|^)T(x,y)| H(x, y) dx dy, (3) 



/=i 



where the inner sum is over the full set of F frames and the outer integral is 
over all pixels. 

The estimation of the parameters of expression (2) using the F frames rather 
than a single pair of images is a distinguishing feature of our work. Other tech- 
niques usually process only two or three consecutive frames. We use all frames 
available as needed. The estimation of the parameters through the minimization 
of a cost function that involves directly the image intensity values is another 
distinguishing feature of our approach. Other methods try to make some type 

^ We use a continuous spatial dependence for commodity. The variables x and y are 
continuous while / is discrete. In practice, the integral is approximated by the sum 
over all the pixels. 
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of post-processing over incomplete template estimates. We proeess direetly the 
image intensity values, through ML estimation. 

The minimization of the funetional C 2 in equation (3) with respect to the set 
of GV constructs {B, 0,T} and to the motions {{p/} , {q/} , 1 < / < T} is a 
highly complex task. To obtain a computationally feasible algorithm, we simplify 
the problem. We decouple the estimation of the motions {{p/} , {q/} , 1 /< 

F} from the determination of the GV constructs {B,0,T}. This is reasonable 
from a practical point of view and is well supported by our experimental results 
with real videos. 

The rationale behind the simplification is that the motion of the object (and 
the motion of the background) can be inferred without having the knowledge 
of the exact object template. When only two or three frames are given, even 
humans find it much easier to infer the motions present in the scene than to 
recover an accurate template of the moving object. To better appreciate the 
complexity of the problem, the reader can imagine an image sequence for which 
there is not prior knowledge available, except that there is a background and an 
ocluding object that moves differently from the background. Since there are no 
spatial cues, consider, for example, that the background texture and the object 
texture are spatial white noise random variables. In this situation, humans can 
easily infer the motion of the background and the motion of the object, even 
from only two consecutive frames. With respect to the template of the moving 
object, we are able to infer much more accurate templates if we are given a 
higher number of frames because in this case we easily capture the rigidity of 
the object across time. This observation motivated our approach of decoupling 
the estimation of the motions from the estimation of the remaining parameters. 

We perform the estimation of the motions on a frame by frame basis by using 
a known motion estimation method [5], see reference [1] for the details. After 
estimating the motions, we introduce the motion estimates into the ML cost 
function and minimize with respect to the remaining parameters. The solution 
provided by our algorithm is sub-optimal, in the sense that it is an approximation 
to the ML estimate of the entire set of parameters, and it can be seen as an initial 
guess for the minimizer of the ML cost function given by expression (3). Then, 
we can refine the estimate by using a greedy approach. We must emphasize, 
however, that the key problem here is to find the initial guess in an expedite 
way, not the final refinement. 

3 Minimization Procedure 

In this section, we assume that the motions have been correctly estimated and are 
known. We should note that, in reality, the motions are continuously estimated. 
Assuming the motions are known, the problem becomes the minimization of the 
ML cost function with respect to the remaining parameters, i.e., with respect 
to the template of the moving object, the texture of the moving object, and the 
texture of the background. 
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3.1 Two-Step Iterative Algorithm 

Due to the special structure of the ML cost function C 2 , we can express explicitly 
and with no approximations involved the estimate O of the object world image 
in terms of the template T. Doing this, we are left with the minimization of C 2 
with respect to the template T and the background world image B, still a non- 
linear minimization. We approximate this minimization by a two-step iterative 
algorithm: (i) in step one, we solve for the background B while the template T 
is kept fixed; and (ii) in step two, we solve for the template T while the back- 
ground B is kept fixed. We obtain closed-form solutions for the minimizers in 
each of the steps (i) and (ii). The two steps are repeated iteratively. The value 
of the ML cost function C 2 decreases along the iterative process. The algorithm 
proceeds till every pixel has been assigned unambiguously to either the moving 
object or to the background. 

To initialize the segmentation algorithm, we need an initial estimate of the 
background. A simple, often used, estimate for the background is the average 
of the images in the sequence, including or not a robust statistic technique like 
outlier rejection, see for example reference [6]. The quality of this background 
estimate depends on the occlusion level of the background in the images pro- 
cessed. Depending on the particular characteristics of the image sequence, our 
algorithm can recover successfully the template of the moving object when using 
the average of the images as the initial estimate of the background. This is the 
case with the image sequence we use in the experiments reported in section 4. 
In reference [1], we propose a more elaborate initialization that leads to better 
initial estimates of the background. 



3.2 Estimation of the Moving Object World Image 

We express the estimate O of the moving object world image in terms of the 
object template T. By minimizing C 2 with respect to the intensity value 0(x, y), 
we obtain the average of the pixels that correspond to the point (x,y) of the 
object. The estimate O of the moving object world image is then 

1 ^ 

0 = T-^M(q/)I/. (4) 

/=i 

This compact expression averages the observations I registered according to the 
motion q/ of the object in the region corresponding to the template T of the 
moving object. 

We consider now separately the two steps of the iterative algorithm described 
above. 



3.3 Step (i): Estimation of the Background for Fixed Template 

To find the estimate B of the background world image, given the template T, 
we register each term of the sum of the ML cost function C 2 in equation (3) 
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according to the position of the camera p j relative to the background. This is a 
valid operation because C 2 is defined as a sum over all the space {(x, y)}. We get 



C2 



r 

^{.M(py)Ij-B [1 - M{pj<if)T 



/=i 



-M{pfqf)0 M{pfqf)T{x,y)'^ M{pf)Udxdy. (5) 



Minimizing the ML cost function C 2 given by expression (5) with respect to the 
intensity value B(x, y), we get the estimate B(x, y) as the average of the observed 
pixels that correspond to the pixel {x, y) of the background. The background 
world image estimate B is then written as 





l-A4(p/qf)T 


M{pf)lf 




i--M(p/qf)T 


M{pf)U 



The estimate B of the background world image in expression (6) is 
the average of the observations 1/ registered according to the background mo- 
tion pj, in the regions {(x,y)} not ocluded by the moving object, i.e., when 
Ad(p/q^)T(x, y) = 0. The term Ad(p/)H provides the correct averaging nor- 
malization in the denominator by accounting only for the pixels seen in the 
corresponding image. 

If we compare the moving object world image estimate O given by equa- 
tion (4) with the background world image estimate B in equation (6), we see 
that O is linear in the template T, while B is nonlinear in T. This has implica- 
tions when estimating the template T of the moving object, as we see next. 



3.4 Step (ii): Estimation of the Template for Fixed Background 

Let the background world image B be given and replace the object world image 
estimate O given by expression (4) in expression (3). The ML cost function C 2 
becomes linearly related to the object template T. Manipulating C 2 as described 
next, we obtain 



C2 



T(x, y) Q(x, y) dx dy + Constant, 



Q{x,y) = Qi(x,y) - Q2(a;,y), 

1 ^ 

Qi(3:> y) = y) - y)f ^ 

/= 23=1 

F 2 

Q 2 ix,y) = [A4(q/)I/(x,y) - A4(q/p^)B(x, y) . 

/=i 



( 7 ) 

( 8 ) 
(9) 

( 10 ) 
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We call Q the segmentation matrix. 

Derivation of expressions (7) to (10) 

Replace the estimate O of the moving object world image, given by expres- 
sion (4), in expression (3), to obtain 



C2 






/ = ! 






9=1 



2 

H dx dy. 



( 11 ) 



Register each term of the sum according to the object position q^. This is valid 
because C 2 is defined as an integral over all the space {(x, y)}. The result is 



C 2 



Af(q/)I/ - M(q/p^)B 



/=i 



M(q/P/ )B - -- ^^(qgjlg 



9=1 



2 

M{qf)n dx dy. 



( 12 ) 



In the remainder of the derivation, the spatial dependence is not important here, 
and we simplify the notation by omitting {x,y). We rewrite the expression for C '2 
in compact form as 



C 2 



F 

Cdxdy, C = J2 

/ = ! 




Bf 






9=1 



Hf, 



(13) 



If = M{qf)If{x,y), Bf = M{qfP*)B{x,y), Hf = M{qf)H{x,y). (14) 
We need in the sequel the following equalities 






and 



EE[^'+^9]=(^-1)E^9- (15) 



9=1 



/=i 9=1 



/=2 9=1 



9=1 



Manipulating C under the assumption that the moving object is completely 
visible in the F images (T7^/ = T,V/), and using the left equality in (15), 
we obtain 



E[2X/R/--; 

/ = ! 



Bi \ - ^ 



- 


to 




+ EF/-11/ 


-9=1 - 


J /=! 






(16) 



The second term of C in expression (16) is independent of the template T. To 
show that the sum that multiplies T is the segmentation matrix Q as defined 
by expressions (8), (9), and (10), write Q using the notation introduced in (14): 
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Q = E E [^/ - 2X/X,] - - 2IfBf] . (17) 

f=2g=l f=l 

Manipulating this equation, using the two equalities in (15), we obtain 



Q = E [2X/% -Bj]-- 

/=i 



F F /-I 



Q=1 



/=2 g=l 



The following equality concludes the derivation: 



F 



1 2 






F F /-I 



E^E2EE^/^9- 



(18) 



(19) 



□ 

We estimate the template T by minimizing the ML cost function given by 
expression (7) over the template T, given the background world image B. It is 
clear from expression (7), that the minimization of C 2 with respect to each spatial 
location of T is independent from the minimization over the other locations. The 
template T that minimizes the ML cost function C 2 is given by the following 
test evaluated at each pixel: 



T{x,y) = 0 

Qi(x,y) > Q 2 (x,y). (20) 

T{x,y) = 1 

The estimate T of the template of the moving object in equation (20) is obtained 
by checking which of two accumulated square differences is greater. In the spa- 
tial locations where the accumulated differences between each frame Ad(q/)I/ 
and the background A4(qgP^)B are greater than the accumulated differences 
between each pair of co-registered frames A4(q/)I/ and A4(qj)Ig, we estimate 
T{x, y) = 1, meaning that these pixels belong to the moving object. If not, the 
pixel is assigned to the background. 

The reason why we did not replace the background world image estimate B 
given by (6) in (3) as we did with the object world image estimate O is that 
it leads to an expression for C 2 in which the minimization with respect to each 
different spatial location T{x,y) is not independent from the other locations. 
Solving this binary minimization problem by a conventional method is extremely 
time consuming. In contrast, the minimization of C 2 over T for fixed B results 
in a local binary test. This makes our solution computationally very simple. 

It may happen that, after processing the F available frames, the test (20) 
remains inconclusive at a given pixel (x, y) (Qi(x, y) ~ Q 2 (a:, y))- in other words, 
it is not possible to decide if this pixel belongs to the moving object or to the 
background. We modify our algorithm to address this ambiguity by defining the 
modified cost function 



^ 21^00 = C 2 + a Area(T) = C 2 + a 



T(x,y) dxdy, 



( 21 ) 
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where C 2 is as in equation (3), a is non-negative, and Area(T) is the area of the 
template. Minimizing C' 2 mod balances the agreement between the observations 
and the model (term C 2 ), with minimizing the area of the template. Carrying 
out the minimization, first note that the second term in expression (21) does 
not depend on O, neither on B, so we get Omod = O and Bmod = B. By 
replacing O in C' 2 mod, we get a modified version of equation (7), 

C 2 MOD = j j T(x, y) [Q(x, y) + a] dx dy + Constant, (22) 

where Q is defined in equations (8), (9), and (10). The template estimate is now 
given by the following test, that extends test (20), 

T(x,y) = 0 

Q(x,y) > -a. (23) 

T(x,y) = 1 

The parameter a may be chosen by experimentation, by using the Minimum 
Description Length (MDL) principle, see reference [4], or made adaptive by a 
annealing schedule like in stochastic relaxation. 

4 Experiments 

We describe two experiments. The first one uses a challenging computer gen- 
erated image sequence to illustrate the convergence of the two-step iterative 
algorithm and its capability to segment complex shaped moving objects. The 
second experiment segments a real life traffic video clip. 



4.1 Synthetic Image Sequence 

We synthesized an image sequence according to the model described in section 2. 
Figure 2 shows the world images used. The left frame, from a real video, is 
the background world image. The moving object template is the logo of the 
Instituto Superior Tecnico (1ST) which is transparent between the letters. Its 
world image, shown in the right frame, is obtained by clipping with the 1ST 
logo a portion of one of the frames in the sequence. The task of reconstructing 
the object template is particularly challenging with this video sequence due to 
the low contrast between the object and the background and the complexity of 
the template. We synthesized a sequence of 20 images where the background is 
static and the 1ST logo moves arround. 

Figure 3 shows three frames of the sequence obtained according to the image 
formation model introduced in section 2, expression (2), with noise variance 
(7^ = 4 (the intensity values are in the interval [0, 255]). The object moves from 
the center (left frame) down by translational and rotational motion. It is difficult 
to recognize the logo in the right frame because its texture is confused with the 
texture of the background. 
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Fig. 2. Background and moving object. 




Fig. 3. Three frames of the synthesized image sequence. 



Figure 4 illustrates the four iterations it took for the two-step estimation 
method of our algorithm to converge. The template estimate is initialized to zero 
(top left frame). Each background estimate in the right hand side was obtained 
using the template estimate on the left of it. Each template estimate was obtained 
using the previous background estimate. The arrows in Figure 4 indicate the flow 
of the algorithm. The good template estimate obtained, see bottom left image, 
illustrates that our algorithm can estimate complex templates in low contrast 
background. 

Note that this type of complex templates (objects with transparent regions) 
is much easier to describe by using a binary matrix than by using contour based 
descriptions, like splines, Fourier descriptors, or snakes. Our algorithm over- 
comes the difficulty arising from the higher number of degrees of freedom of the 
binary template by integrating over time the small intensity differences between 
the background and the object. The two-step iterative algorithm performs this 
integrations in an expedite way. 



4.2 Road TrafRc 

In this experiment we use a road traffic video clip. The road traffic video sequence 
has 250 frames. Figure 5 shows frames 15, 166, and 225. The example given in 
section 1 to motivate the study of the segmentation of low textured scenes, see 
Figure 1, also uses frames 76 and 77 from the road traffic video clip. 

In this video sequence, the camera exhibits a pronounced panning motion, 
while four different cars enter and leave the scene. The cars and the background 
have regions of low texture. The intensity of some of the cars is very similar to 
the intensity of parts of the background. 




Maximum Likelihood Estimation of the Template of a Rigid Moving Object 



47 




Fig. 4. Two-step iterative method: template estimates and background estimates. 





Figures 6 and 7 show the good results obtained after segmenting the sequence 
with our algorithm. Figure 7 displays the background world image, while Figure 6 
shows the world images of each of the moving cars. The estimates of the templates 
for the cars in Figure 6 becomes unambiguous after 10, 10, and 14 frames, 
respectively. 



5 Conclusion 

We develop an algorithm for segmenting 2D rigid moving objects from an image 
sequence. Our method recovers the template of the 2D rigid moving object by 
processing directly the image intensity values. We model both the rigidity of the 
moving object over a set of frames and and the occlusion of the background by 
the moving object. 
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Fig. 5. Traffic road video sequence. Frames 15, 166, and 225. 




Fig. 6. Moving objects recovered from the traffic road video sequence. 




Fig. 7. Background world image recovered from the traffic road video sequence. 



We motivate our algorithm by looking for a feasible approximation to the ML 
estimation of the unknowns involved in the segmentation problem. Our method- 
ology introduces the 2D motion estimates into the ML cost function and uses 
a two-step iterative algorithm to approximate the minimization of the resultant 
cost function. The solutions for both steps result computationally very simple. 
The two-step algorithm is computationally efficient because the convergence is 
achieved in a small number of iterations (typically three to five iterations). 

Our experiments show that the algorithm proposed can estimate complex 
templates in low contrast scenes. 
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Abstract. The design of good features and good similarity measures 
between features plays a central role in any retrieval system. The use of 
metric similarities (i.e. coming from a real distance) is also very impor- 
tant to allow fast retrieval on large databases. Moreover, these similarity 
functions should be flexible enough to be tuned to fit users behaviour. 
These two constraints, flexibility and metricity are generally difficult to 
fulfill. Our contribution is two folds: We show that the kernel approach 
introduced by Vapnik, can be used to generate metric similarities, espe- 
cially for the difficult case of planar shapes (invariant to rotation and 
scaling). Moreover, we show that much more flexibility can be added by 
non-rigid deformation of the induced feature space. Defining an adequate 
Bayesian users model, we describe an estimation procedure based on the 
maximisation of the underlying log-likehood function. 



1 Introduction 

The selection of good features from objects and the design of efficient similarity 
measures appear to play a central role for any retrieval algorithm for searching 
a database (see [17] for a recent survey on content-based retrieval). Moreover, 
similarity functions should come from true distances, hereafter called metric sim- 
ilarities, between objects to allow content-based databases using similarity-based 
retrieval to scale to large databases (several thousands up to several millions of 
objects) [21]. However, this last property appears to be a hard constraint for the 
usual “home cooked” similarity functions, especially when dealing with objects 
defined up to the action of some group of transformations (as planar shapes). 
These metric similarities should also ideally be seen as Euclidean distances be- 
tween points in some vector space. The usual way to achieve such a result is 
to define some feature vector for each object and to identify the similarity with 
a possibly weighted Euclidean distance, or more suitably, to extract first some 
principal components from a PGA analysis of the feature vectors. 

More formally, a huge amount of distance can be obtained by a simple em- 
bedding procedure: Let (p : X ^ Z he a, one to one function from the object 
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space or so called image space, to some feature space equipped with a structure 
of Hilbert space. Then 



d^{x,x') = \(f){x) - (f){x')\z (1) 

defines a distance on the object space. Now, defining the kernel fc^(x,x') = 
{(f>{x),(f){x'))z, we get for the distance 

d^{x,x') = (k,p{x,x) + k^{x',x') -2k^{x,x'))^^'^ . (2) 

The recent development of kernel methods, is based essentially on the fact 
(the so called kernel trick) that instead of defining explicitly the embedding 
(f, it can be enough to start from a closed expression of an admissible kernel 
(satisfying the famous Mercer’s conditions) so that all the computations can be 
done without the need of any explicit expression of 4>. This approach has been 
followed successfully in statistical learning, through the so called support vector 
machines and their extensions [20,5] but also in data mining through nonlinear 
principal component analysis [15]. When objects are planar shapes, this kernel 
approach can be quite interesting to generate real distance between shapes. One 
difficulty arises if rotation, scale and translation invariance is required. In [19], 
we propose a construction of an invariant polynomial kernel for planar shapes. 
The first contribution of this paper will be to show an important extension of 
the previous invariant kernel into a large family of kernels which can still be 
computed in a very fast way. 

However, concerning distance functions families, and the derivation of metric 
similarities, despite the fact a quite large family of different kernels are known 
(polynomial kernel, radial basis functions kernels, etc), all theses families appear 
to be of low dimensional parametrically (often one or two parameters) , allowing 
in fact a restricted flexibility. Moreover, this flexibility is highly required for at 
least two main reasons. The first one is that simple retrieval processes based 
on the retrieval of the k-nearest objects around the query-example, can be very 
inefficient if the metric similarity does not fit the user’s “implicit” similarity (if 
exists!). The second one is that efficient retrieval systems should provided some 
feedback mechanism to improve the relevance of retrieved objects, through inter- 
action with the user. This interaction should be built around a good user model 
allowing to predict for a given target and two displayed objects, which will be 
chosen as the most similar and with what probability. The entropy reduction 
scheme has been advocated by the PicHunter’s group in [4] and Geman et al in 
[7,8] and strongly depends, according to the terminology in [8], on the “synchro- 
nisation” between the actual user behaviour and the predicted user behaviour 
on a given target hypothesis. Now, it appears that some partial learning of the 
metric similarities should be done in a sufficiently large set of metric similarities. 

In some recent work in [3], Chapelle et al introduce some flexibility in the 
kernel through scaling parameters between different features. Their basic idea 
is to tune the variance parameters ct, according to some estimation of the test 
error T(((j,)) given by a SVM built on the kernel k„{x,x') = exp(— |xj — 
x'|^/2a^)) (a similar version is provided for polynomial kernels). In some sense. 
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independently of them, we have followed some common route, but with different 
means and in a different framework. In this paper, we want to show how one 
can improve in a surprising way a similarity measure by non-rigid deformation 
of a finite dimensional subspace of the feature space. Starting from the feature 
mapping cf> induced by a chosen “primary” kernel k, the basic idea is to try to learn 
a perturbation cj) = cj) + u{4>) of the feature mapping by successive presentation 
to the user of triplets of objects x = (xi,X 2 ,xs)- For each of them, the user is 
asked to give the position y of the most similar pair, by mental comparison of the 
individual similarity of the three possible pairs {x 2 , Xs} {y = 1), {xi, X3} {y = 2) 
and {xi,X2} {y = 3). Hence, the user answers y = j if {xj+i,Xj+2} is the most 
similar pair (the sums j + 1 and j + 2 are mod 3). Let = {Xi, ■■■ ,X]\[) he N 
triplets independently generated and = (Li, ■ ■ • ,L(v) be the corresponding 
users answers. We will use {Xi^ ,Y-^) as a learning set for the selection of <p 
through a Bayesian inference presented in section 3. If for any distance d, and 
any triple X, we denote 

yl'*) = argminjd(Yj+i, Yj+2) , (3) 

the goal is to select the perturbation cf> such that for the distance d = (see 
(1)), the classification rule gives good generalisation performance on a new 
triplet X. 

The paper is organised as following: in section 2, we present a new family of 
invariant kernels for planar shapes. These kernels can be used as primary kernels 
inducing different features space Z. Then, in section 3, we present our Bayesian 
inference framework and the derivation of the deformation process driven by a 
variational formula. Finally, in section 4, we report some learning experiments 
in the case of planar shapes. 

2 Rotation and Shift Invariant Kernels for Planar Shapes 

2.1 Shape-Based Search 

We focus our experiments on planar shapes, since we believe that in such a case, 
keywords are not very useful for non expert users (possibly speaking different 
languages), to describe and select a shape in a retrieval process. Note that some 
existing retrieval and indexing systems perform shape-based search, such as the 
IBM’s QBIC [6] but still in a limited way (see [10] for an extended review). A 
notable exception is the Shape Query Using Image Database (SQUID) prototype 
developed by University of Surrey which entirely relies on shapes similarities (but 
without the usual functionalities of a real retrieval system). However, many issues 
on that field are still open such as the design of good similarity functions between 
shapes and efficient retrieval strategies for big database as well. It is commonly 
accepted that rotation, translation and scale invariance are highly recommended 
for any shape similarity function. Moreover, as said in the introduction, metric 
similarities based on real distances (satisfying the triangle inequality) are much 
more suited for big databases since there exists faster search procedures in that 
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case. Several such similarity functions are available from the literature, based on 
Fourier descriptors [14], on curvature scale space [12], on modal representation 
[16,9] and other features (see [10] and references therein). All of them appear 
to have good reported features, but none of them is a real metric similarity. As 
far as we know, the first real rotation and scale invariant metric similarity for 
arbitrary shapes has been proposed by Younes in [22] as a geodesic distance for 
a special Riemannian structure on a shape manifold. However, the computation 
time for this distance penalised its use for large databases. Another common 
drawback to the previous reported distances is that they can not be deduced 
from any obvious kernel as defined in the introduction. Therefore, no principal 
component analysis of the shape database can be performed (in this special case, 
the use of kernel PCA is unavoidable if one want a rotation scale and even shift 
(choice of the starting point in the curve representation of a shape) invariant 
analysis). 



2.2 A New Family of Invariant Kernels for Planar Shapes 



We want to show in this section a systematic way to get a family of shift and 
rotation invariant kernel. Note that we are interested in the case of closed curves 
defined as functions c : T — >■ where T is the one dimensional torus K/Z. 

Since we want also the translation invariance, for any pair of curves c and c', 
we will define K{c, c') = k{f, g) where f = c and g = c' are the time derivative 
of c and c! . Moreover, since we are interested in the shapes of the curves and 
not really in the actual parameterisation of these shapes, we will assume that 
I/I = 1 and \g\ = 1 i.e. that the curves are parametrised by arc-length. Hence, 
we can focus on shift and rotation invariant kernel defined on the closed ball 
B 






Definition 1. For any 9 £ [0, 27t], we denote re the rotation in defined by 

Ysm(0) cos(Cl) 

We denote by F the set of all the functions from T to 

Definition 2. For any u £ [0, 1] and 9 £ [0, 27t] we define the shift : F ^ F 
and Rg : F ^ F by Ty,{f){x) = /(x + u) and Rg{f){x) = rg{f{x)). 



Definition 3. We say that k \ B x B ^ R is a shift and rotation invariant 
kernel if 

1. For any u £ [0, 1], any 9 £ [0, 27t] and any f and g £ B, we have 

k{Rg oTuif),g) = Hf,g) ■ 

2. For any finite family {fi)i<i<n of points in B, and any family (a.,)i<j<„ of 
real valued numbers, we have ctiajk{fi, ffi > 0 

3. The kernel k is symmetric 
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We can now state the main result of this section 

Theorem 1. Let (a„)„gN be a sequenee of non-negative real valued numbers 
such that On < +oo. Then let us define tp{x) = Xln>o for any x e 

[—1,1] and define for any f and g in B 

k{f,g)= [ i’{\Pf.g{u)\'^)du , 

where Pf.g{u) = fjf(x + u)g{x)dx (here we identify with C and g denotes 
the complex conjugate). Then, k is a shift and rotation invariant kernel on B 
and 

k{f,9) = ^[ (p{{tuO Re{f),Ty o Rgfig)))dudvdedO' , (4) 

J[0.1]2x[0,27r]2 

where <p{x) = Yl,q>oO,q-^^x^'^ and { , ) denotes the usual dot produet on 

L2(T,R2). 

Proof. First one can remark that, thanks to the Stirling formula, (f is well defined 
and on ] — 2,2[. Moreover, if we prove equality (4), then the shift and 
rotation invariance is deduced immediately. By linearity, it is sufficient to prove 
the inequality for tp{x) = x'^ for an arbitrary d G N, which is proved by theorem 
2 given in appendix A. 

As a consequence, one can build easily a full range of kernels choosing for instance 

1. ip{x) = x'^ for any positive integer d, 

2. ip{x) = exp(7x) for any 7 > 0, 

3. ip{x) = 1/(1 — jx)'^ for any 7 g] 0, 1[ and any positive integer d. 

Remark 1. Let us remark that if we start from a closed formula for ip, the com- 
putation of k can be achieved in an efficient way. Indeed, if we consider two 
discrete version /A and gjg in N steps of / and g by fN[k) = f{k/N) and 
9N{k) = g{k/N), we define 



1 X 

kN{fN,9N) = V'(|PN,/,s(fc)P) , 

^ fc=0 

where pNj,g{k) = ^ /w(j + k)'^{j). Hence, if we consider the Fourier 

decomposition of ftq and gjg, we get 

.. JV-l .. 7V-1 

PNJ,g{k) = ^ ^ ^ ^ f N ,l9 N ,10" ^ ■ 

Hence, using discrete FFT, the computation of pwj.g can be achieved in Nlog{N) 
steps. 
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An interesting application is where ip{x) = (1 + x)^ which corresponds to ip{^) = 
1 + 2x. In that ease, we get 



2 2 

fcyv(/yv,ff/v) = 1 X] ^ ^ \fN,i?\9N,i\‘^ ■ 

k=0 k=0 

In [19] , we show that this kernel can be used to build an unsupervised hierarchieal 
clustering tree based on kernel PCA. Here this kernel will be used as a good 
starting point to define a well suited feature space Z. The associated distance 
will be denoted dc- We report in fig. 1, a comparison of our kernel distance 
for ip{x) = (1 + x)'^, d = 4, and the more involved approach using curvature 
seale space developed in [12]. The remarkable faet is that we get comparable 
results but with a simple real metric similarity, still rotation and shift invariant. 
Moreover, in the feature spaee, this distanee is just a usual Hilbert distance, so 
we can tune it to fit users behaviour as described now. 





Fig. 1. (a) (b) (c): three K-NN retrieval experiments (the query is the upper- left 
shape) using the kernel distance for ip{x) = (1 + x)"'; (a’) (b’) (c’), corresponding 
K-NN retrieval according to the CSS matching values from the SQUID demo site 
http : //www . ee . surrey . ac.uk/Research/VSSP/ imagedb/demo .html 



3 Deformation Process 

As developed in the introduction, we believe that there is no reason that the first 
selected metric similarity should fit perfectly the implicit similarity underlying 
users decision processes. We agree with the discussion developed by Cox et al in 
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[4] and Geman and Moquet in [8] about the need of psychophysical experiments 
to improve similarity functions and ultimately the users model. We develop our 
approach in this section. From now on, the space Z will denote the feature space 
underlying the choice of the initial kernel, called the primary kernel. 

Let E he a finite dimensional afhne subspace of Z with direction E (for 
any zq E E, E = zq + E) and ps be the orthogonal projection on E. For any 
displacement field u : E ^ E, we can define 

cf>u{x) = (!){x) + u{pE{cf>{x))) . (5) 

3.1 Probabilistic Model 

We assume that users answers are driven by the choice of a random U drawn 
according to a Gaussian model and that given U, the different answers are in- 
dependent so that 

I U) = Y^\u) . (6) 

For a given choice of U and a given experiment X, the answer Y of the user is 
assumed to follow the probabilistic model 

P{Y = j I 1 , U) (X exp(-7d^„(Y,+i,X,+2)') • (7) 

In [7,8], this kind of Bayesian model has been proposed but with some important 
differences. On one hand, they assume that the user selects a distance according 
to a law which could depend also on the triplet itself which is a more general 
model. One the other hand, the model for the choice of the distance is simpler 
than our, and based on the random selection of a point in the convex set of the 
mixture of a small set of predefined distances. 

Now, the Bayesian estimator for the answer Y to a new experiment X is 
given by Y = argmaxP(Y = j \ Xf , Y^^, 1 ) . Denoting Zf = (Xf , Yj^), we 
deduce that 



P(Y = J I Yf , X) = E{P{Y =j\U,X)\ Zf , X) , 
so that we get for X = 

Y = aTgmaxE{(fi^{j, U) \ Xf , Y/^) (8) 

with = P{Y = j \ X = x,U). The integration (8) over the posterior 

law of U given the observations X(^ ,Y^ could be computed by Monte Garlo 
simulations [2] . However, this approach may lead to prohibitive GPU time so that 
we focus on a simpler approximation for which the a posteriori law is replaced 
by the Dirac measure Su^, where U* is the maximum a posteriori of U [11] 

[/* = argmaxF([7 = u j X(^, Yj^). (9) 

Note here, that the previous definition of [/* is ambiguous in this infinite dimen- 
sional setting. This should be understood in a limit sense from finite dimension 
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approximations [13]. Finally, the Bayesian decision is approximated by the max- 
imisation in j of P{Y = j I X,Ut:), so that using (7), we get our approximate 
estimator 



Y* = argmin^d^y^ {Xj+i,Xj+2)- 



( 10 ) 



3.2 Variationnal Formulation 

From equation (9), since the prior on U is Gaussian, if T-L is its reproducing 
kernel Hilbert space with norm || ||v (which can be understood by the formal 
expression dP{U = u) (x exp( — 5 ||u||y)dt), then [/* is the defined as the element 
u G V achieving the maximum value of 

1 

W{u) = --\\u\\l + J2^og(^P{Y, \ X,,U = u)) (11) 

i=l 

Here again, the equation (11) cannot be derived rigorously from the Bayesian 
formulation but this could be done in a limit sense on finite dimensional approxi- 
mations [13]. Another way is to start directly from (11) as variational formulation 
of the problem. 

To specify the Gaussian prior, we choose an orthonormal basis (ei, • ■ • , Cp) 
of E, such that U{z) = and we assume that the coordinates are 

independent and that the covariance structure is given by 

E{U\z)U^' {z')) = \lk^k' 9 a-{z - z') 

for any z,z' £ E, where g^T^iz) = exp(-|z|2/2CT^)/(27r(r^)J’/^ is defined from 
E to M. (other covariance structure could be used but this one induces smooth 
random fields and depends on a single parameter a). Then the associated repro- 
ducing kernel Hilbert space is defined by (* is the convolution) 

F = I u{z) = ^(ff^ 2/2 * h!^){z)ek I hk G L'^{E,R,dz) 

I fc=i 



and ||u||y = y \h^{z)\‘^dz. To perform the computation, we use a finite 

element approximation of the space F, where 

Vl = {u{z) = '^aig^2{ti - z)ek | of G K} 

k, l 

where (t;)i<i<L is a family of control points chosen in E. We have Vl C V and 

ll^llv = afgf/gt72 [ti — tji) 

l , 1', k 

for u G Vl- The problem of optimisation of W on Vl is now an optimisation 
problem in dimension L x dim(iS) for which we use a gradient descent. The 
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existence of a maximizer for W in V could be proved easily by showing the con- 
tinuity of log (^P(Y^ \X,,U = for the weak topology on V but in finite 
dimension, this existence is straightforward. Importantly, the gradient should be 
computed with respect to the dot product on V, the so called “natural gradient” 
[1,18]. This allows in particular a good stability of the solution with respect to 
the number of control points. 

4 Experiments 

We have performed our experiment on two databases, the IAT[5X database^ and 
the African database^. We fix a random sequence of 220 triplets of shapes. Five 
members of the L2TI laboratory have been asked to decide the most similar 
pair among the three possible pairs as described previously with a friendly GUI 
interface. The same sequence of triplets is presented to each user. In many cases. 




Fig. 2. Examples of triplets presented to the users from the UTFX database 



for example when the three curves seem to have nothing in common, different 
users make different choices. We denote the answer of the kth user to the 
ith experiment in the sequence. The sequence is then split in two subsets C 
(learning set) and T (test set). We start from the primary kernel kc presented 
in section 2 with {(fi{x) = (1 -h x)^), and we keep only the 10 first principal 
nonlinear components (we have checked that the remaining squared variation 
on the orthogonal space is negligible) dealing with a reduced feature space Z of 
dimension 10. Now, choose for E, the affine subspace going through the mean 
value z of the features vectors in the database and with direction E given by the 
first p principal (nonlinear) axes (the reported results have been obtained with 
p = 3 and 60 controls points). We use the learning set to learn a deformation 
as defined in the previous section, and we get a new distance d* = (see (1) 
and (5)). On the test set, we compare the prediction (see (3)) for a given 

^ This database contains outlines of randomly deformed UT^X symbols 
^ This database contains outlines of homogeneous ecological regions from the inner 
delta of the River Niger in Mali (14000 different shapes) 







Metric Similarities Learning through Examples 



59 



distance d with the answers given by the users leading to the empirical 
score 

1 ® 

I fe=i igr 

For comparison to the human performance, we compute the mean cross-predic- 
tion score on the test set between the different humans: 



•^hum 



1 

20|T| 



k^k’ i€T 



(13) 



To see the improvement of the “learning” process in the selection of the dis- 
tance, we propose different comparison against various “home cooked” (or not) 
distances. The first distance is the usual “distance” defined by di(c, c')^ = 
inf(j,g / |e*^c(u + s) - c'(u)p du. The second one is a geodesic distance between 
curves defined in [22], The third one is the distance dc deduced from the primary 
kernel kc (see (2)). For all the presented results, the size of the test set \T\ = 100 
and the size of the learning set |£| = 120. The following table is the output of 
the experiment. 



Table 1. First column: the different distances used for prediction; second column: 
database; third column: African database. 



Database 


DTeX = 77.5% 


African = 70% 


Distances d 


Jd 




Jd 




L“ distance 


57.2% 


73.8% 


48.8% 


71.4% 


geodesic distance 


59.2% 


76.4% 


50% 


71.43% 


kernel distance dc 


58.6% 


75.6% 


59% 


84.4% 


deformed distance d* 


77.2% 


99.6% 


70% 


100% 



Concerning the experiments on the IAT[5X database, one remarkable fact is 
that the first three distances have very close scores about 75% of the human 
performance. We also test other kernels distances through optimisation of the 
coefficients in the ijj expansion without getting really important improvements. 
After the learned deformation, the performance is very close to the human’s 
one. We think that the gap between 75% and 99.6% could not be filled by any 
existing parametric family of kernel. However, our approach needs basically only 
two continuous parameters ct, A to be tuned (in the reported results cr = 0.3 and 
A = 20). Same comments could be done for the African database. 

We display in figure 3 the effect of the deformation on the representation in 
the feature space of the shapes involved in the experiment (we display only the 
first two components). Wc can sec strong effect on the deformation process. To 
have a better visualisation of it, we display in figure 4 the 3D deformation of the 
2D subspace corresponding to the first two principal axes. 
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(a) (b) 



Fig. 3. The first two principal components distribution, (a) directly from KPCA pro- 
jection with kernel fee, (b) after “learning” process database) 

= 20, =0.3, N^^, = 60 




Fig. 4. Deformations effect on a 2D subspace of the feature space (DTj^X database) 



5 Conclusion 

The previous cxpcriineiits show that wc can efficiently tunc a primary designed 
kernel to fit the human behaviour. Our feeling is that the choice of the primary 
kernel should be done in order to embed in the feature space all the invariance 
properties we need (as the rotation and scale invariance in our case). However, 
one can not expect a fine tuning without feedback through examples. The intro- 
duction of non-rigid deformation driven by a well chosen subspace of the feature 
space gives the necessary flexibility. Moreover, the Bayesian framework we have 
developed shows that the information given by the experiments with the users 
can be used to define the good deformations. The complexity of the learning 
algorithm remains low (the learning time does not exceed few seconds on a stan- 
dard PC). Such approach could be used in many context and allow to define 
fitted similarity measures which still are distances. This is an important fact for 
speed and efficiency in all subsequent steps in a retrieval process. 

A Appendix 

In this appendix, we prove a preliminary result (Theorem 2) which is the core re- 
sult for the proof of theorem 1 . The central idea, will be to extend the framework 
to functions with values in C^. 
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Let us consider 2 -dimensional space and denotes e = (ei, 62) the canonical 
basis. Note that the rotation tq can be extended straightforwardly on C? (with 
the same matrix representation in the new canonical basis) as well as the defini- 
tion of Re and r„. We define on the usual hermitian product define for any 
2: = (21, 22) and z' = {z[,Z2) in by {z, z')^2 = 2:12:1 + 2:22:3. We define now the 
new orthonormal basis t ~ (ti, t2) by t\ = (ei +ie2)/V^ and t2 = (ei — ie2)/i/2. 
Moreover, as usual, for any / and g in L^(T,C^), we denote {f,g) the hermitian 
product on L^(T,C^) defined by (f,g) = fj. (f{x), g{x))^2 dx. 

Let d e N. We consider for any functions /, g £ L^(T,C^), 

^(/, 5 ) = t^ [ {{tuO Rg{f),T^o Re'{g))f dudvd 9 d 9 ' 

J[0,1)2x[0,2,t]2 

For any / e C^), we define fi and /2 in L'^(T,C) by /i(x) = {f{x),ti)^2 
and f2{x) = {f{x),t2)^2 so that / = fiti + /2t2- In the sequel, we identify 
with the subspace Kei © Mc2. 

Lemma 1 . Assume that d = 2 q is even. Then, for any f,g£ L^(T,K^) we have 
k{f,g) = /[o 1] C^q \fj.Ty,f2{x)g2{x)dxf“ du . If d is odd, then k{f,g) = 0 . 

Proof. First, we notice that for any 9 £ [ 0 , 27 t], rg{ti) = and rg{t2) = 

e*®t2 so that Re{f) = + e*®/2t2- Hence, one get that {Rg{f),g) = 

Jrj. e-*‘> fi'^{x) + e’'^ f2^(x)dx so that, raising to the power of d and integrating 
over 9 G [ 0 , 27 t], we obtain finally ^ f (R,{f), gf d 9 = ^ 

{It ^ d9. Using the fact that J ed'^~‘^^^^d9 = 0 if d ^ /c, we 

get the result for d odd. If d = 2g then we have ^ J[o 2 -k] {^o{f ),g)‘^ d9 = 
Cl, (/t [If f 2 {.x)lpl{x)dxy . Now, noticing that /2 = fi, we get 
^ / {Ro{f),gyd 9 = l/j. /2(x)^(x)dx|^®. The proof is ended since {tuO 
Reif),Tv o Re'ig)) = {tu-v ° Rg-e'if),g). 

Now, for / : T — > K^, we get that /2 = j(/) where j is the canonical injection 

of in C given by j{{x,y)) = x + iy for any x,y e Hence, one get the 

following theorem 

Theorem 2 . Let f and g be two measurable funetions in L^(T,M^). Let d 
be a positive integer and assume that ip{x) = x^. Then we have k{f,g) = 

i[o 1] C|, I Jt + u)g{x)dx\^'^ du where we identify f and g with j{f) and j(g). 
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Abstract. Bayesian methods have been avoided in 3D ultrasound. The 
multiplicative type of noise which corrupts ultrasound images leads to 
slow reconstruction procedures if Bayesian principles are used. Heuristic 
approaches have been used instead in practical applications. 

This paper tries to overcome this difficulty by proposing an algorithm 
which is derived from sound theoretical principles and fast. This algo- 
rithm is based on the expansion of the noise probability density function 
as a Taylor series, un the vicinity of the maximum likelihood estimates, 
leading to a linear set of equations which are easily solved by standard 
techniques. Reconstruction examples with synthetic and medical data 
are provided to evaluate the proposed algorithm. 



1 Introduction 

This paper addresses the problem of 3D ultrasound. 3D ultrasound aims to 
reconstruct the human anatomy from a set of ultrasound images, corresponding 
to cross-sections of the human body. Based on this information, the idea is to 
estimate a volume of interest for diagnosis proposes. This technique is wide 
spread due essentially to its non invasive and non ionizing characteristics [1]. 
Furthermore, the ultrasound equipment is less expensive than other medical 
modalities, such CT, MRI or PET [2,3]. One way to perform 3D ultrasound 
is by using 2D ultrasound equipment with a spatial locator attached to the 
ultrasound probe, giving the position and orientation of the cross-section along 
the time(see Fig.l). The estimation algorithm should fuse these information, 
image and position, to estimate the volume. 

Traditionally ultrasound imaging technique is made in real time using the 
B-scan mode. The inspections results are visualized in real time being allowed 
to the medical doctor to choose the best cross sections for the diagnosis. In 3D 
ultrasound this goal is much more difficult to achieve since the amount of data is 
much higher. However, reconstruction time should be kept as small as possible. 
This is the reason why a lot of algorithms used in 3D ultrasound are designed 
in ad hoc basis [4,5], aiming to be as simple and fast as possible. 

Bayesian approaches in 3D ultrasound have been avoided since these methods 
are usually computationally demanding. In this paper we present an algorithm 
for 3D ultrasound designed in a Bayesian framework. Its theoretical foundation is 

* Correspondent author: to Joao Sanches, IST/ISR, Torre Notre, Av. Rovisco Pais, 
1049-001 Lisboa, Portugal, Emaii:jmrs@alfa.ist. utl.pt, Phone:+351 21 8418195 
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Fig. 1. 3D ultrasound acquisition system 



presented as well the simplification procedures and justifications in order to speed 
up the reconstruction process. Our goal is to designed an efficient reconstruction 
algorithm to work in a quasi real time basis, while keeping a solid theoretical 
foundation. 

This paper is organized as follows. Section 2 describes the problem of 3D 
reconstruction and the notation adopted in this paper. Sections 3 and 4 present 
two algorithms for 3D reconstruction: the standard solution and a fast algorithm. 
Section 5 present experimental results with synthetic and real data using both 
algorithms. Finally section 6 concludes the paper. 

2 Problem Formulation 

This section describes the reconstruction of a 3D function / from a set of ultra- 
sound images. Additional details can be found in [6]. 

Let us consider a scalar function f{x) defined in 17 c E?, i.e., f ■. Q ^ R. We 
assume that this function is expressed as a linear combination of known basis 
functions, i.e., 

fix) = Y^bg{x)ug ( 1 ) 

a 

where U\,U 2 , are the unknown coefficients to be estimated and h{xi) are 

known basis functions centered at the nodes of a 3D regular grid. Let be 
a set of intensity data points, measuring f{x) at locations {x,;}, belonging to 
one of the inspection planes. It is assumed that intensity measurements, are 
corrupted by multiplicative noise and the goal is to estimate f{x) based on the 
observations {y,}. 

This estimation problem can be formulated in a Bayesian framework using 
a MAP criterion, as follows: given a set of data Y = {y^} with a distribution 
p{Y\U) which depends on the unknown parameters, U = {ug} with a prior 
distribution p{U), estimate U in order to maximize the joint probability density 
function of the data and parameters, p{Y, U), i.e.. 
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Fig. 2. Volume and image coordinates 



U = argm&^\og{p(Y\U)p{U)) (2) 

In this paper we assume that all the elements of Y are i.i.d. (independent 
and identically distributed) [8] with a Rayleigh distribution, i.e., 

iogp(r|e) = B>os(7|y)-^) W 

where f{xi) is the value of the function / to be reconstructed at Xj. 

The Rayleigh distribution is achieved in [9] from physical principles that the 
human tissue consists of a large number of independent scatters with different 
orientations. 

The prior used is gaussian [10], i.e., 
p{U) = 

Zj 

where Ugi is a neighbor of Ug and Z is a normalization factor. 

Therefore, the objective function can be expressed as 

L([/) = /([/)+g([/) (5) 

where I = logp(V/(7) is the log likelihood function of the data and q = log p{U) 
is the logarithm of the prior associated to the unknown parameters. 

To optimize (5) the ICM algorithm proposed by Besag [7] is used. The ICM al- 
gorithm simplifies the optimization process by optimizing the objective function 
with respect to a single variable at a time, keeping the other variables constant. 
Each step is a ID optimization problem which can be solved in a number of 
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Fig. 3. Neighboring nodes of a data point 



ways. This step is repeated for all the unknown coefficients in each iteration of 
the ICM algorithm. 

To optimize (5) with respect to a single coefficient Up the stationary equation, 



dl{U) ^ dq{U) 

dup dup 



(6) 



is numerically solved. 

The next sections present two approaches to compute (6) . Section 3 attempts 
to solve this equation using nonlinear optimization methods. Section 4 presents a 
fast algorithm based on the solution of a linear set of equations. The second per- 
form some simplifications in order to speed up the computations. Both methods 
are iterative. 



3 Nonlinear Method 



Let us first compute the derivatives of I and q. 

After straightforward manipulation it can be concluded that 



dl{U) 

dup 



1 

2 



IK 



- 2/(i,) 



bp{xi)) 



(7) 



where the sum is performed for all data points that are in the neighborhood 
[—A, A]^ of the p-th node. In fact, each data point contributes to the estimation 
of its 8 neighboring coefficients (see Fig. 3). 

It is also easy to concluded that 

dq{U) 

dup 






( 8 ) 
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where is the number of neighbors of Up {Ny = 6) and Up is the average 
intensity computed using the Ny neighbors. 

To optimize the objective function a set of non-linear equations should be 
solved, 

\ ~ - Up) = 0 (9) 

This is an huge optimization problem, which must be solved using numerical 
methods. The ICM algorithm proposed by Besag [7] is used and each equation 
is numerically solved by using the Newton-Rapson method assuming that the 
other coefficients, Uk,k p are known. The computation of the solution of (9) 
is computationally heavy, presenting some undesirable difficulties. 

First, it would be nice to factorize the equation in two terms, one depending 
only on the data and the other depending on the unknown to estimate, 

h{up)gi{Y)ri{U \ {tip}) + g 2 {Y)r 2 {U \ {up}) + C = 0 (10) 

where gi(Y) and (72(F) are sufficient statistics. This formulation would allow 
to concentrate the influence of the observed data on a small set of coefficients, 
computed once for all at the first iteration and kept unchanged during the op- 
timization process. Data processing would be done only once speeding up the 
estimation process. 

Unfortunately, it is not possible to write (9) in the form of (10), i.e., there 
are no sufficient statistics for the estimation of the interpolating function /. 
This means that all the data must be read from the disk and processed in each 
iteration of the nonlinear reconstruction algorithm. This is a strong limitation 
when a large number of cross-sections is involved, e.g., 1000 images with 640 x 
480 pixels will lead to 3072 x 10® pixels, preventing a wide spread use of this 
algorithm. 

Another important difficulty concerns the stability of the convergence pro- 
cess. The system of equations (9) is non-linear. The stability of the numerical 
methods used to solve it, strongly depends on the data and on the regularization 
parameter, ip, and on the initial estimates of U. The process of finding the right 
parameters to obtain acceptable reconstructions is in general often done by trial 
and error. 

To overcome these difficulties an approximation approach is proposed in the 
next section. 



4 Linear Solution 



Let us develop l{U) in Taylor series about the maximum likelihood estimates, 
Uml, 






P dUy 



.ML\ 



1 d'^l{Up 



,ML\2 



du^ 



r + ^ (11) 




68 



Joao M. Sanches and Jorge S. Marques 



the first derivative of /([/) with respect to Up is 



dl{U) _ 

dup du^ 




( 12 ) 



where it was assumed that ^ = 0 since by definition Uml is a stationary 

point of 1{U). The residue e was discarded for convenience. 

Thus (6) takes the form 



dL{U) 

dup 



dul 



- 2i)Np{up 



— 0 



(13) 



leading to 



,ML 



1 + 'Tp 



'P 

-Ur, 



1 + T„ 



(14) 



wneie Tp — 

Equations show that the MAP estimation can be seen as a linear combination 
of the ML estimates with the average intensity computed in the neighborhood 
of each node. 

Let us compute the maximum likelihood estimation of U . 

Assuming that f{x) changes slowly in the neighborhood of each node, i.e., 
/(x,) « Up will be used in (7) to obtain 



^ Lip 



(15) 



Solving with respect to Up^ leads to 



ml ^ 

'P 2 



(16) 



and by deriving (15) in order to Up leads to 

dul 

This expression for the second derivative of the log likelihood function, ob- 
tained by deriving (15) with respect to Up, can be more accurately computed 
if (7) is used. By deriving two times (7) with respect to Up and after replacing 
f{xi) by Up it obtains: 



. Eblix,) 

dul (u“^")^ 



We have used expression (17) in the reconstruction using synthetic data and (18) 
in the case of real data. 
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Fig. 4. Cross sections extracted from a synthetic 3D cube 



Therefore, the MAP estimate of the volume of interest is obtained by solving 



a system of linear equations given by (14) where Tp 

where is given by(16). 

For sake of simplicity (14) can be rewrite as 






,ML\2 






and 



Up — kp T Cp Up (19) 

where kp = and Cp = These parameters, kp and Cp are computed 

once for all during the initialization phase. The solution of (14) can be done by 
standard algorithms for the solution of linear sets of equations. 



5 Experimental Results 

This section presents two 3D reconstruction examples using synthetic and real 
data. 

The synthetic data consists of a set of 100 images of 128 x 128 pixels cor- 
responding to parallel cross sections of the 3D interval [ — 1,1]^ (see Fig. 4). 
The function to be reconstructed is assumed to be binary: f{x) = 5000, x £ 
[—0.5, 0.5]^, /(x) = 2500 otherwise. The cross sections were corrupted with 
Rayleigh noise according to (3). The histogram of the whole set of images is 
shown in Fig. 5 and is a mixture of two Rayleigh densities. Both reconstruction 
algorithm were used to reconstruct / in the interval [—1,1]^ using a regulariza- 
tion parameter -i/; = 16.10^®. 

Fig. 6 shows the profiles extracted from the estimated volumes using both 
methods. These two profiles are quite similar which means that both methods 
lead to similar results in this problem. The SNR of 21.4dB for the nonlinear 
method and 20.8dB for the fast algorithm proposed in this paper stresses the 
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Fig. 5. Synthetic data set histogram 




Fig. 6. Profiles extracted from the original volume and from the estimated volumes 
using the nonlinear and linear methods 



ability of the linear algorithm to produce similar results as those obtained with 
the nonlinear method. 

It should be stressed that the linear method is less heavy in computational 
terms. Fig. 7 and Fig. 8 show the evolution of the posteriori distribution function, 
log(p(y, f/)) along the iterative process. Fig. 7 displays log(p(y, C/)) as function 
of the index of the iteration while Fig. 8 displays the same values as function of 
the time. 

The nonlinear algorithm converges in less iterations (62) than the linear algo- 
rithm(97) in this example. However, since each iteration of the nonlinear method 
is slower and it involves processing all (millions) of the observations, the conver- 
gence is slower in terms of computation time (see Fig. 8) (in this case about 6 
times slower than the linear method^). 

^ These values depend on the number and dimensions of the images and on the desired 
accuracy for the solution. For very accurate solutions it is need more iterations and 
the linear method becomes more efficient 
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Fig. 7. L(U) along the iterative process 



6 

x10 




Fig. 8. L(U) along the iterative process in function of time 



Fig. 9 shows cross sections of the 3D volume (left) as well as the 3D surface 
of the cube displayed using rendering methods (right). The results are again 
similar, the nonlinear method performing slightly better at the transitions. 

The real data if formed by a set of 100 images of a human thyroid with 
128 X 128 pixels. Fig. 10 shows the corresponding histogram. This histogram 
reveals some significant differences from the one of the synthetic data. In the 
case of the synthetic data the underlying 3D object is binary while in the case 
of the real data a continuous range of reflectivity values are admissible. 

Profiles extracted from both estimated volumes are shown in Fig.ll. In this 
figure are also shown images belonging to the initial data set. The profiles were 
computed from images extracted from the estimated volumes with dimensions 
and positions equivalent to the cross-sections shown in the figure. In this graph it 
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Fig. 9. Reconstructed volumes, a)using the nonlinear method and b) using the linear 
method 




Fig. 10. Real data histogram 



is also shown a profile extracted from a maximum likelihood estimates computed 
by using the expression (16). Here, the difference between both methods are 
more visible which is related with the deviation of the real data from the true 
Rayleigh model. However, we conclude once more that the linear method leads 
to acceptable results, similar to the ones obtained with the nonlinear algorithm. 

6 Conclusion 

This paper presents an algorithm to estimate the acoustic reflectivity in a given 
region of interest from a set of ultrasound images. The images are complemented 
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-—ML — non-linear — linear 



Fig. 11. Profiles extracted from the estimated volumes using the nonlinear and linear 
method 



with the position and orientation of the ultrasound probe. The proposed algo- 
rithm is formulated in a Bayesian framework using a MAP criterion. To speed 
the reconstruction time a simplified (linear) algorithm was proposed based on 
the concept of sufficient statistics. 

The goal is obtain a fast and efficient MAP algorithm to estimate volumes 
in a quasi real time basis. Reconstruction results obtained with both methods 
are presented, one using a set of images extracted from a synthetic 3D cube and 
the other using a set of real cross-sections of a human thyroid. Both examples 
show that the fast(linear) algorithm performs almost as well as the nonlinear 
version. Profiles extracted from the estimated volumes are quite similar and the 
signal to noise ration (for the synthetic case only) computed with the original 
volume reenforce this similarity. It is concluded that the linear algorithm needs 
more iterations to reconstruct the volume than the nonlinear one but it spend 
mutch less time. This is explained by the fact that the linear method only has 
to process the huge amount of data only once, while the nonlinear method must 
read and process the data in each iteration. A final note should be provide. The 
formulation of the linear method is more simple than the nonlinear method. 
The estimation process in the first case is obtained by solving a set of linear 
equations while in the non simplified case a set of non-linear equations should 
be solved. The nonlinear method present problems of convergence and stability, 
that are not addressed in this paper, which are also solved by using the linear 
reconstruction method proposed in this paper. 
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Abstract. Hidden Markov Models (HMMs) are an useful and widely 
utilized approach to the modeling of data sequences. One of the prob- 
lems related to this technique is finding the optimal structure of the 
model, namely, its number of states. Although a lot of work has been 
carried out in the context of the model selection, few work address this 
specific problem, and heuristics rules are often used to define the model 
depending on the tackled application. In this paper, instead, we use the 
notion of probabilistic bisimulation to automatically and efficiently de- 
termine the minimal structure of HMM. Bisimulation allows to merge 
HMM states in order to obtain a minimal set that do not significantly 
affect model performances. The approach has been tested on DNA se- 
quence modeling and 2D shape classification. Results are presented in 
function of reduction rates, classification performances, and noise sensi- 
tivity. 



1 Introduction 

Hidden Markov Models (HMMs) represent a widespread approach to the mod- 
eling of sequences: they attempt to capture the underlying structure of a set 
of symbol strings. HMMs can be viewed as stochastic generalizations of finite- 
state automata, when both transitions between states and generation of output 
symbols are governed by probability distributions [1]. 

The basic theory of HMMs was developed by Baum et al. [2,3] in the late 
1960s, but only in the last decade it has been extensively applied in a large 
number of problems. A non-exhaustive list of such problems consists of speech 
recognition [1], handwritten character recognition [4], DNA and protein mod- 
elling [5], gesture recognition [6] and, more in general, behavior analysis and 
synthesis [7]. 

HMMs fit very well in a large number of situations, in particular where the 
state sequence structure of the process examined can be assumed to be Marko- 
vian. Unfortunately, there are some drawbacks [8]. First, the iterative technique 
for the HMM learning {Baum-Welch re-estimation) converges to a local opti- 
mum, not necessarily the global one, and the choice of appropriate initial parame- 
ters’ estimates is crucial for convergence. Second, a large amount of training data 
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is generally necessary to estimate HMM parameters. Finally, the HMM topol- 
ogy and number of states have to be determined prior to learning, and usually 
heuristic rules are pursued for this purpose (e.g., [9]). This paper proposes a novel 
approach for resolving this final problem, in particular to determine the number 
of states. This issue could be tackled by using traditional methods of model selec- 
tion; numerous paradigms have been proposed in this context, a non-exhaustive 
list includes [10]: Minimum Description Length (MDL), Bayesian Inference Cri- 
terion (BIC), Minimum Message Length (MML), Mixture Minimal Description 
Length (MMDL), Evidence Based Bayesian (EBB) etc.. More computational in- 
tensive approaches are stochastic approaches (e.g., Markov Chain Monte Carlo 
(MCMC)), re-sampling based schemes, and cross-validation methods. Although 
principally derived for fitting mixture models, many of these techniques could 
be applied also in the HMM context, as proposed in [11] and [12]. It is worth 
noting that these approaches are devoted to find the optimal model on the basis 
of a criterion function by exploring all (or a large part of) the search space. Our 
work proposes instead a direct method to identify the model without search- 
ing the whole space, resulting less computationally intensive. In [11], starting 
with redundant configuration, an optimal structure can be obtained by repeated 
Bayesian merging of states in an incremental way, as far as new evidence arrives. 
In [12], a method for simultaneous learning of HMM structure and parameters 
is proposed. Parameters’ uncertainty is minimized by introducing an entropic 
prior and Maximum a Posteriori Probability (MAP) estimation. In this way, 
redundant parameters are eliminated and the model becomes sparse; moreover 
posterior probability increases, and an easier interpretation of resulting archi- 
tecture is allowed. 

Our approach consists in eliminating syntactic redundancy of an Hidden 
Markov Model using a technique called bisimulation. Bisimulation is a notion 
of equivalence between graphs whose usefulness has been demonstrated in var- 
ious fields of Computer Science. In Concurrency it is used for testing process 
equivalence [18], in Model-Checking as a notion of equivalence between Kripke 
Structures [20], in Web- like databases for providing operational semantics to 
query languages [17], in Set Theory, for replacing extensionality in the context 
of non well-founded sets [13]. 

With our approach, the structure of an HMM is reduced by computing bisim- 
ulation equivalence relation between states of the model, so that equivalent states 
can be collapsed. We employed both the notions of probabilistic and standard 
bisimulation. We will prove that bisimulation reduces the number of states with- 
out significant loss in term of likelihood and classification accuracy. We will test 
this approach reporting experiments on DNA sequence modelling and 2D shape 
recognition using chain code. We will show that the proposed procedure is fully 
automatic, efficient, and provides promising results. We also compare our ap- 
proach with BIC (Bayesian Inference Criterion) method, which is equivalent to 
MDL [10], showing that this technique is nearly as acceptable as our, as far as 
classification accuracy is concerned, but is more computationally demanding. 
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The rest of the paper is organized as follows: Sect. 2 contains formal descrip- 
tion of HMM. In Sect. 3, the notion of bisimulation and the algorithm to compute 
equivalence classes are described. In Sect. 4 we detail our strategy and in Sect. 5 
experiments and results are presented. Finally, Sect. 6 contains conclusions and 
future perspectives. 

2 Hidden Markov Models 

An HMM is formally defined by the following elements (see [1] for further de- 
tails) : 

— A set S = {S\,S 2 , • ■ ■ , Sm} of (hidden) states. 

— A state transition probability distribution, also called transition matrix A = 
{ttij}, representing the probability to go from state to state Sj. 

Otj = P[qt+i = Sj\qt = S,] l<i,j<N (1) 

with Uij > 0 and Oij = 1. 

— A set V = {vi,V 2 , ■ ■ ■ , vm} of observation symbols. 

— An observation symbol probability distribution, also called emission matrix 
B = {bj{k)}, indicating the probability of emission of symbol Vk when system 
state is Sj. 

bj{k) = P[vk at time t \qt = Sj] 1 < j < N,1 < k < M (2) 
with bi{k) > 0 and bj{k) = 1. 

— An initial state probability distribution tt = {TTi), representing probabilities 
of initial states. 

TTi = P[qi = >5^] 1 < i (3) 

with TTj > 0 and = 1. For convenience, we denote an HMM as a 

triplet A = {A,B,n), which determines uniquely the model. 

3 Bisimulation 

Bisimulation is a notion of equivalence between graphs useful in several fields of 
Computer Science. The notion was introduced by Park for testing process equiv- 
alence, extending a previous notion of automata simulation by Milner. Milner 
then employed bisimulation as the core for establishing observational equivalence 
of the Calculus of Communicating Systems [18]. 

Kanellakis and Smolka in [16] relate the bisimulation problem with the gen- 
eral (relational) coarsest partition problem and pointed out that the partition 
refinement algorithm in [19] solves this task. More precisely, in [19] Paige and 
Tarjan solve the problem in which the stability requirement is relative to a re- 
lation E (edges) on a set N (nodes) with an algorithm whose complexity is 
0(|i?|log|iV|). 
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Standard Bisimulation. Bisimulation can be equivalently formulated as a relation 
between two graphs and as a relation between nodes of a single graph. We adopt 
the latter definition since we are interested in reducing states of a unique graph. 

Definition 1. Given a graph G = (N,E) a bisimulation on G is a relation 
b G N X N s.t. for all uq, ui G N s.t. uq b ui and for i = 0,1: if (uj, Vi} £ E, 
then there exists e E s.t. vq bvi. 

In order to minimize the number of nodes of a graph, we look for the maximal 
bisimulation = on G. Such a maximal bisimulation always exists, it is unique, 
and it is an equivalence relation over the set of nodes of G [13]. The minimal 
representation of G = {N, E) is therefore the graph: 

W {(H = ) W = ) n) e E}) 

which is usually called the bisimulation contraction of G. Using the algorithm 
in [19] the problem can be solved in time 0{\E\ log \N\)-, for acyclic graphs and for 
some classes of cyclic graphs it can be solved in linear time w.r.t. [A^l + |iS| [15]. 

Bisimulation on labeled graphs. If the graphs are such that nodes and/or edges 
are labeled, the notion can be reformulated as follows: 

Definition 2. Let G = {N, E, i) be a graph with a labeling function I for nodes, 
and labeled edges of the form m n (a belongs to a set of labels). A bisimulation 
on G is a relation b C N x N s.t. for all uq,ui g N s.t. uq bui it holds that: 
£{ui) = £{u 2 ) and for i = 0,1, ifui EfVi £ E, then there exists -V G E 
s.t. Vo bvi. 

If only the nodes are labeled, the procedure in [19] can be employed to find the 
bisimulation contraction, provided that in the initialization phase nodes with 
the same labels are put in the same class. The case in which edges are labeled 
can be reduced to the last one by replacing a labeled edge m — n by a new 
node i> labeled by a and by the edges {m,n) and {i',n). Therefore, finding the 
bisimulation contraction also in this case can be done using the algorithm of [19]; 
moreover, the procedure of [19] can be modified in order to deal directly (i.e., 
without preprocessing) with the general case described. 

Probabilistie Bisimulation The notion of bisimulation over labeled graphs (Def. 2) 
has been introduced in a context where labels denote actions executed (e.g. a 
symbol is emitted) by processes during their run. Labels can also store pairs of 
values {x,y): an action x and a probability value y (that could be read as: this 
edge can be crossed with probability y and in this case an action x is done). 
In this case another notion of bisimulation is perhaps more suitable. Consider, 
for instance, the graph of Fig. 1 (we use ni-ug to refer to the nodes: they are 
not labels), nj and ng are trivially equivalent since they have no outgoing edges. 
Nothing can be done in both the cases. The four nodes n 2 ,ng,no,no are in the 
same equivalence class, since they have equivalent successors (reachable perform- 
ing the same action b, with probability 1). The nodes ni and are instead not 
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Fig. 1. rii and rt4 are not bisimilar, but probabilistically bisimilar. 



equivalent, since, for instance, there is the edge ni but no edges labeled 

(a, 0.4) starts from n^. However, both from rii and U 4 it can be reached one of 
the equivalent states, performing action a with probability 0.7: the two nodes 
should be considered equivalent. These graphs are called Fully Probabilistic La- 
belled Transition System (FPLTS). 

The notion of probabilistie bisimulation [14] is aimed at formally justifying 
this intuitive concept. We start by providing two auxiliary notions: Given a graph 
G = {N, E) with edge labeled by pairs as above, and b <Z N x N a relation, then 
for two nodes m,n E N and a symbol a, we define the functions B and S as 
follows 



B{r 






3q{m n E E A p, b n)} and S{m, n, a) 



E 






Definition 3. Let G = {N, E) be a graph with edge labeled by pairs consisting 
of symbols and probability values, a probabilistic bisimulation on G is a relation 

h C N X N s.t.: for all uq, u\ E N, if uq b u\ then for i = 0,1 if Vi E E, 

then then there exists e N s.t.: 

- ^ Vi^i E E, 

- S(ui,u,,a) = S'(ui_,,ui_i,a), and 

- and for all m E B{ui, Vi,a) and n E B{u\-i, vi-i, a) it holds that mbn. 

In [14] a modification of the Paige- Tarjan procedure is presented in this 
case and proved to correctly return the probabilistic contraction of a graph 
G = {N,E) in time 0(|iV||i?| log |A^|). In the example of Fig. 1 the two nodes 
ni and are put in the same class. 

In this paper we will further extend the possible labels for edges. We admit 
triplets {pi,a,p 2 ) where a is a symbol while pi and p 2 are probabilistic values. 
We extend the notion of the above Definition 3 point to point. In other words, 
we reason as if the edge {pi,a,p 2 ) is replaced by the two edges (a,pi), (d,p 2 ) 
and d can not be confused with a (see Fig. 2). 

4 The Strategy 

HMM as labeled graphs. Probabilistic bisimulation is defined on FPLTS, which 
are slightly different from HMMs. Neglecting notation, the real problem is rep- 
resented by emission probability of each state, which has not counterpart in 
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Fig. 2. m and ri 4 are probabilistically bisimilar, m and ris are not. 



FPLTS. As described in Sect. 3, we can solve the problem by choosing an appro- 
priate initial partition, whose sets contains states with same emission probability 
and then run the algorithm of [19]. This approach is correct, but it is too restric- 
tive with respect to the concept of probabilistic bisimulation. In other words, 
using this initialization we create classes of bisimulation equivalence using con- 
cept of syntactic labelling, loosing instead the semantic labeling concept, which 
is the kernel of the probabilistic bisimulation. 

Thus, we propose another method, a bit more expensive in terms of memory 
allocation and computational cost, but offering a better semantic characteriza- 
tion. 

Definition 4. Given a HMM A = (A, i?,7r), trained with a set of strings from 
an alphabet V = {vi,V 2 , ■ ■ ■ ,vm}, the equivalent FPLTS is obtained as follows. 
For each state St: 

— Let Ai be the set of edges outgoing from the state Si, defined as 

A, = {(S',, Sj) : a,y ^ 0, 1 < j < Nj 

— each edge e in A, is replaced by M edges, whose labels are {aij,Vk,Bi{k)), 
where, for ^ < i, j < N , 1 < k < M : 

• o.,j is probability of e; 

• Vk is k-th symbol of V; 

• Bi{k) is probability of emission of from state Si- 



<0.56,a,0.33> 
<0.56,c,0.12> 

Fig. 3. Basic idea of procedure to represent HMM as a FPLTS. 





Given an HMM with N states, K edges and M symbols, with this ap- 
proach the complexity of bisimulation contraction grows from 0{KNlogN) to 
O(MKNlogN) for time, and from 0{KN) to 0{MKN) for space. 

By applying bisimulation to a HMM we have to face another important issue: 
the partial control of compression rate of our strategy. To this end, we introduce 
the concept of quantization of probability: given a set of quantization level values 
(prototypes) in the interval [0,1], we approximate each probability with the 
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closest prototype. A uniform quantization is adopted on interval [0,1]. To control 
this approximation we define a reduction factor, representing the number of levels 
that subdivide the interval: it is calculated as (number of prototypes - 2). For 
example, reduction factor 3 means that probability are approximated with the 
values {0,0.25,0.5,0.75,1}. Thus, the notion of equivalent labels is governed 
by the test of equality of their quantization, where quant {p) is define as the 
prototype j closest to p. 

As a final consideration, the reduction factor represents a tuning parameter 
for deciding the degree of compression adopted. Obviously, for a low value of the 
factor, information lost in approximation is high, and the resulting model can 
be a very poor representation of the original one. 

Algorithm. Given a problem, determining optimal number of HMM states is 
performed following the following steps: 

1. Training of HMM with a number of states that is reasonably large with re- 
spect to the problem considered. This number strongly depends from avail- 
able data, and it can be determined using some heuristics. 

2. Transform HMM in labelled graph (FPLTS), using procedure described in 
Def. 4 of Sect. 4. In this step we have to choose a reduction factor, that pro- 
vides a measure of accuracy adopted in the conversion. It also gives a rough 
meaning of reduction rate: lower precision likely means higher compression. 

3. Run bisimulation algorithm on such graph, obtaining equivalence classes. 
Optimal number of states N' is represented by cardinality of the quotient 
set (i.e. the number of different classes determined by bisimulation). 

4. Retraining of the HMM using N' states. 

This method is designed for discrete HMM, but can be generalized for other 
typologies by working on Step 2 of the procedure. 

5 Experimental Results 

The aim of the following experiments is to show that this method reduces HMM 
states without significant loss in terms of likelihood and classification accuracy. 
We tested these two properties on two distinct problems: DNA modeling, i.e. us- 
ing HMM to model and recognize different DNA sequences (typically, fragments 
of genes), and 2D shape classification using chain code (modeled by HMM). In 
all tests, each HMM was trained in three learning sessions, using Baum- Welch 
re-estimation and choosing the one presenting the maximum likelihood. Each 
learning started using random initial estimates oi A, D and tt and ended when 
likelihood is converged or after 100 training cycles. Performances are measured 
in terms of some indices: 

— Compression Rate, representing a percentage measure of the number of states 
eliminated by bisimulation: CR = 100 ^ j ^ where Nreduct are the 

number of states after bisimulation on a HMM with Nong states; 
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— Log Likelihood Loss, estimating the difference in LL between original and re- 
duced HMM: LLL = 100 ^ LLa„^-LLr.„d,,at j ^ ^hej-e LLreduct and LLorig are 
log likelihood of HMM with Nreduct and Norig number of states, respectively. 



5.1 DNA Modeling 

Genomics offers tremendous challenges and opportunities for computational sci- 
entists. DNA are sequences of various lengths formed by using 4 symbols: A, 
T, C, and G. Each symbol represent a base, Adenine, Thymine, Cytosine, and 
Guanine respectively. Recent advances in biotechnology have produced enormous 
volumes of DNA related information, needing suitable computational techniques 
to manage them [21]. 

From a machine learning point of view [22], there are three main problems to 
deal with : genome annotation, including identification of genes and classification 
into functional categories, computational comparative genomics, for comparing 
complete genomic sequences at different levels of detail, and genomic patterns, 
including identification of regular pattern in sequence data. Hidden Markov Mod- 
els are widely used in resolving these problems, in particular for classification of 
genes, protein family modeling, and sequence alignment. This is because they 
are very suitable in modeling strings (as DNA or protein sequences), and can 
provide useful measures of similarity (LL) in comparing genes. 

In this paper, we employ HMM to model gene sequences for classification pur- 
poses. This simple example is nevertheless significant to demonstrate HMM abil- 
ity in recognizing genes, also in conditions of noise (as biological mutations). Data 
were obtained extracting a 200 bp (base pair) fragment of recA gene sequence 
of a lactobacillus. We trained 95 HMMs on this sequence, where N (number of 
states) grows from 10 to 200 (step 2). We applied the bisimulation contraction 
algorithm on each HMM, with reduction factor varying from 1 to 9 (step 2), 
computing the number of resulting states. We then compared Log Likelihood 
(LL) of original sequence produced by original and reduced HMMs, obtaining 
results plotted on Fig. 4(b). One can notice that the two curves are very similar, 
in particular when reduction factor is high. In Table 1, average and maximum 
loss of likelihood (LLL) are presented for each value of resolution factor, with 
maximum compression rate: loss of Log Likelihood is fairly low, decreasing when 
augmenting precision of bisimulation (reduction factor). This kind of analysis is 
performed to show the graceful evolution of the HMM likelihood when number 
of states is decreased using bisimulation. 

In Fig. 4(a) original number of states vs. reduced number of states are plotted, 
at varying number of states. More precisely, for a generic value N on abscissa, 
ordinate represents the number of states obtained after running bisimulation on 
A-states HMM. It is worth noting that compression rate increases when the 
number of states grows: this is reasonable, because small structures cannot have 
a large redundancy. 

The second part of this experiment tries to exploit performance of our al- 
gorithm regarding classification accuracy. To perform this step we trained two 
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(a) (b) 



Fig. 4. Compression rate (a) and comparison of Likelihood curve for original and re- 
duced HMM (b) on DNA modeling experiment. Reduction factor are 1, 5 and 9. 



Table 1. Maximum compression rate, average and maximum Log Likelihood loss for 
DNA modeling experiment at varying reduction factors. 



Reduction factor 


Maximum CR (%) 


Average LLL (%) 


Maximum LLL (%) 


1 


50.00 


9.57 


32.20 


3 


38.16 


6.69 


20.07 


5 


33.14 


5.45 


20.31 


7 


34.87 


5.91 


18.90 


9 


34.04 


5.77 


22.30 



HMMs with 150 states on 200 bases fragments of two different recA genes: one 
was from glutamicum bacillus and second was from tubercolosis bacillus. Each 
HMM was then reduced using bisimulation, varying reduction factor from 1 to 9 
(step 2). Then, HMMs were retrained with reduced number of states, resulting 
in 10 reduced HMMs (5 for each sequence). Compression rate varies from 32% 
for reduction factor 1 to 22% for reduction factor 9 (see Table 2). We tested 
classification accuracy of HMMs using 300 sequences, obtained by adding syn- 
thetic noise to the original two. The noising procedure is the following: each 
base is changed with fixed probability p (ranging from 0.3 to 0.4), and following 
determined biological rules (for examples, A becomes T with probability higher 
than G). Each sequence of this set was evaluated using both models, and clas- 
sihed as belonging to the class whose model showed highest LL. Error rate was 
then calculated counting misclassified trials and dividing by the total number 
of trials. Figure 5 shows error rate for original and reduced HMMs, varying the 
probability of noise. One can notice that error rate trend is quite similar, and 
that error is very low, always below 5%, proving that HMMs work very well on 
this type of problems. In Table 2 (a-b), average errors on original and reduced 
HMMs are presented, respectively, varying noise level and reduction factor value. 
For the latter, maximum compression rate and maximum LL loss are also pre- 



84 



Manuele Bicego, Agostino Dovier, and Vittorio Murino 




noise level 



Fig. 5. Error rate for different noise level for DNA modeling experiment. 

Table 2. Error on original and reduced HMMs for DNA modeling experiments in 
function of (a) varying noise level, and (b) varying reduction factor value. 

(a) (b) 



Noised 

Level 


Error on 
Original 
(%) 


Error on 
Reduced 
(%) 


0.3 


0.00 


0.00 


0.325 


0.33 


0.13 


0.35 


1.67 


0.80 


0.375 


1.67 


1.07 


0.4 


4.67 


2.60 



Reduction 

Factor 


Average 

CR 

(%) 


Average 

LLL 

(%) 


Error on 
Original 
(%) 


Error on 
Reduced 
(%) 


1 


32.00 


3.89 


1.67 


1.27 


3 


25.33 


1.72 


1.67 


0.80 


5 


22.00 


4.15 


1.67 


0.80 


7 


21.33 


5.61 


1.67 


0.80 


9 


21.66 


2.14 


1.67 


0.93 



sented. One can notice that the difference between two errors grows with noise 
level, i.e., error value becomes higher when noise level increases, and differences 
can be more significant. Nevertheless, LL losses are very low if compared with 
compression rate and amount of noise. Actually, classification errors remain be- 
low 5%, even on experiments with 40% noise level. Moreover, error level seems 
to be lower in the reduced case than in the original one. Reasonably, HMMs 
with less states are able to generalize better, so as recognize also sequences with 
higher noise, even if we expect a breakdown point, causing a reversing behavior 
between original and reduced HMMs. 



5.2 2D Shape Recognition 

Object recognition, shape modeling, and classification are related issues in com- 
puter vision. A lot of three-dimensional (3-D) object recognition techniques are 
based on the analysis of two-dimensional (2-D) aspects (images) and several work 
can be found in literature on the analysis of 2-D shape or presenting methods 
devoted to planar object recognition. 

A key issue is the kind of image feature used to describe an object, and its 
representation. Object contours are widely chosen as features, and their represen- 
tation is basic to the design of shape analysis techniques. Different types of ap- 
proaches have been proposed in the previous years, like, e.g., Fourier descriptors, 
chain code, curvature-based techniques, invariants, auto-regressive coefficients, 
Hough-based transforms, associative memories, and others, each one featured 
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Fig. 6. Toy images for 2D shape recognition using Chain Code. 



by different characteristics like robustness to noise and occlusions, invariance to 
translation, rotation and scale, computational requirements, and accuracy. 

In this paper, HMMs are proposed as a tool for shape classification. This 
preliminary experiment aims at presenting a simple example on the capability of 
HMM on discriminating object classes, showing its robustness in terms of partial 
views and, in minor way, of noise. Shape is modeled using chain code, a well- 
known method to represent contours, which presents some inherent characteristic 
like the invariance to rotation (if code local differences are considered), and 
translation. 

Although a large literature addresses these issues, the use of HMM for shape 
analysis has not been widely addressed. To our knowledge, only the work of 
He and Kundu [9] has been found to have some similarities with our approach. 
They utilize HMMs to model shape contours represented as auto-regressive (AR) 
coefficients. Results are quite interesting and presented in function of the number 
of HMM states ranging from 2 to 6. Moreover, shapes are constrained to be a 
closed contour. 

In our experiment, although limited to a pair of similar objects, the degree 
of occlusion is quite large, and noise has been included to affect object coding, 
(without heavily degrading classification performances) . Due to lack of space, we 
will not present results on rotational and scale invariance. Let us only state that 
scale is not a problem, as the HMM structure can manage it due to the possibility 
of permanence in the same state. Actually, simple tests on some differently scaled 
and noised objects have confirmed that HMMS behave correctly in this case. A 
detailed description of our approach with extensive experiments is not in the 
scope of this paper, and will be the subject of our future work. In this paper, we 
would only like to show the capabilities of the HMM to discriminate (also similar) 
shapes and its stable performances when the minimal structure is obtained by 
bisimulation with respect to the redundant topology. 

In our experiment, given an image of 2D objects, data are gathered assigning 
at each object its chain code, calculated on object contours. Edges are extracted 
using Canny edge deteetor [23] , while chain code is calculated as described in [24] . 
Fig. 6 shows the two simple objects, a stylized hammer and a screwdriver, used in 
the experiment. We train one HMM for each object, varying the number of states 
from 4 to 20. After applying bisimulation contraction, with reduction factor 
from 1 to 9, we re-trained HMMs with reduced number of states and compared 
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Table 3. Maximum compression rate, average and maximum Log Likelihood loss for 
2D shape recognition test, at varying reduction factor. 



Reduction factor 


Maximum CR 


Average LLL 


Maximum LLL 


1 


16.74 


4.91 


72.20 


3 


9.43 


0.55 


29.39 


5 


6.33 


2.30 


72.26 


7 


6.40 


1.72 


72.85 


9 


4.71 


1.28 


68.73 



them in term of Log Likelihood. Average and maximum Log Likelihood loss 
are calculated, and results are shown in Table 3, with maximum compression 
rate for different reduction factor values. Average LLL values are confortantly 
low: bisimulation does not seem to affect HMM characteristics. Nevertheless, we 
can also observe that average loss is very low compared with related maximum 
LLL. This is because compression is not so strong, as evident in Table 3, and 
therefore some learning session on reduced HMM can produce better results in 
terms of Log Likelihood. LL of an HMM on a sequence typically grows with N. 
On the other hand, LL depends on how well the training algorithm worked on 
the data. Baum- Welch re-estimation ensures to reach the nearest local optimum, 
without any information about global optimum. So, it is possible that for closed 
Ni, N 2 , with Ni < N 2 , a HMM with N\ states shows larger LL than those 
with N 2 states, because the training algorithm worked better. To partially solve 
the problem of convergence, each HMM was trained three times, starting with 
different random initial conditions. The case of so high LL loss may be explained 
by a low compression rate (the HMMs have the similar number of states) and 
very bad training (in this case three trials seems to be insufficient to ensure 
correct learning). 

For testing classification accuracy, we synthetically create two test sets. The 
first set is obtained considering, for each object, fragments of their chain code of 
variable length, expressed as percentage rate of the whole length. It varies from 
20 to 90 percent, and the point where fragment starts was randomly chosen. 
The second set is obtained by adding synthetic noise to the two chain codes, 
using a procedure similar to that used for DNA noising procedure. Each code is 
changed with fixed probability P, i.e. if cc^ is the original code, with probability 
P, (((cci - 1) ± 1) mod 8) + 1 is carried out. Probability ranges from 0.05 to 
0.35, and, for each value, 60 sequences are generated. As usual, a sequence is 
assigned to the class whose model shows the highest Log Likelihood, and error 
rate is estimated counting misclassified patterns. For each of the two test sets, we 
calculate performance using original and reduced HMMs and varying reduction 
factor from 1 to 9. In Table 4, average error for original and reduced HMMs on 
set of pieces are presented varying reduction factor from 1 to 9. We can see that 
the difference between two errors is very low. 

The same results are presented in Table 5 for a set of noisy sequences, varying 
reduction factor (Table 5(a)) and noise level (Table 5(b)). 
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Table 4. Error on original and reduced HMMs for 2D shape recognition experiment 
(fragments set): (a) varying resolution factor; (b) varying fragment length. 

(a) (b) 



Fragment 
Length (%) 


Error on 
Original (%) 


Error on 
Reduced (%) 


20 % 


4.50 


4.33 


30 % 


3.60 


3.28 


40 % 


2.77 


2.32 


50 % 


3.23 


2.31 


60 % 


3.23 


1.75 


70 % 


2.83 


1.36 


80 % 


0.00 


0.23 


90 % 


0.00 


0.01 



Reduction 

factor 


Error on 
Original (%) 


Error on 
Reduced (%) 


1 


2.52 


2.91 


3 


2.52 


2.19 


5 


2.52 


1.51 


7 


2.52 


0.44 


9 


2.52 


2.70 



Table 5. Error on original and reduced HMMs for 2D shape recognition experiment 
(noised set): (a) varying resolution factor; (b) varying noise level (b). 

(a) (^)) 



Noise 
level (%) 


Error on 
Original (%) 


Error on 
Reduced (%) 


5 


11.33 


9.64 


10 


20.5 


17.24 


15 


27.11 


23.61 


20 


31.67 


28.21 


25 


35.24 


31.70 


30 


37.78 


34.16 


35 


39.95 


36.33 



Reduction 

factor 


Error on 
Original (%) 


Error on 
Reduced (%) 


1 


29.08 


24.83 


3 


29.08 


29.05 


5 


29.08 


21.14 


7 


29.08 


28.23 


9 


29.08 


25.97 



A consideration can be made on performance of HMMs applied to this prob- 
lem: average error in recognizing the fragment sequence is 1.21%, a very low 
value. This means that a simple HMM can be invariant of some type of object 
occlusions. Nevertheless, noise seems to be a more serious problem, but working 
on topology and training algorithms classification accuracy may be less affected 
by this problem. 

Another point regards the similarity of the two objects which may seriously 
affect performances. Using very different objects this problems may be attenu- 
ated. More extensive tests on invariance on scale and rotation should be carried 
out to better evaluate HMM performance for shape classification. 



5.3 Comparison with Other Methods 

Regarding the model selection approaches present in literature and listed in Sec- 
tion 1, an interesting comparative evaluation is presented in [25]. In that paper, 
a comparison between MDL/BIC, EBB and MDL for gaussian mixture model is 
reported, showing comparable performances and proving their superiority with 
respect to other methods. For convenience, we choose the BIC method [26] for our 
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comparative analysis. BIC is a likelihood criterion penalized by the model com- 
plexity, i.e., in our case, the number of HMM states. Let X = {x^,i = 1, ■ ■ • , N} 
be the data set we are modeling and M = {Mi,i = 1, ■ • ■ , AT} be the candidate 
models. Let us denote as |Mj| the number of parameters of the model Mj, and 
assuming to maximize the likelihood function C{X, Mi) for each possible model 
structure Mi, the BIC criterion is dehned as: 

BIC{M,) = log£(V,M,) - i|M,|log(lV) 

This strategy selects the model for which the BIC criterion is maximized. 

We compare our strategy with this approach related to the 2D shape experi- 
ment. We train 18 HMMs, with states number varying from 3 to 20, and for each 
model we compute the BIC value. BIC vs number of states curves are plotted 
in Fig. 7, for the two objects. We then choose the HMM showing the highest 




(a) (b) 



Fig. 7. BIC value vs number of states curves for the 2D shape recognition experiment 
with (a) the hammer and (b) the screwdriver. 



BIC value (corresponding to 12 and 14 states, respectively for screwdriver and 
hammer) . 

With our bisimulation approach we train one HMM with 20 states, apply 
bisimulation and train another HMM with calculated number of states, varying 
reduction factor fro 1 to 9 (step 2). To compare the two methods we create a test 
set by adding synthetic noise (of various entity) to the two chain codes, in a way 
similar to that presented in the previous section, obtaining, for each noise level, 
120 sequences to be classified. We then calculate the classification error applying 
the two approaches, presenting results in Table 6, in function of variable noise 
level. We can notice that, on the average, classification accuracy is quite similar: 
in fact BIC method needs 18 training session, while our method only two, plus 
the time for determining bisimulation contraction (that is 0{MKN log N), given 
an HMM with N states, K edges and M symbols). In problems with a short 
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Table 6. Comparison between BIC method and our approach. 



Method 


States 


Classification Error | 


Screw. 


Hammer 


Noise 

0.05 


Noise 

0.10 


Noise 

0.15 


Noise 

0.20 


Noise 

0.25 


Noise 

0.30 


Noise 

0.35 


BIC 


12 


14 


13.33 


25.00 


32.50 


36.88 


39.83 


41.81 


42.98 


Bisim RE 1 


14 


15 


20.00 


30.00 


33.89 


36.25 


42.67 


44.44 


48.57 


Bisim RE 3 


15 


16 


21.67 


34.17 


39.44 


42.08 


43.67 


44.72 


45.48 


Bisim RE 5 


18 


18 


10.00 


10.83 


22.78 


29.17 


34.67 


40.28 


35.71 


Bisim RE 7 


17 


19 


28.33 


38.33 


42.22 


44.17 


45.33 


46.11 


46.67 


Bisim RE 9 


20 


20 


31.67 


37.50 


37.22 


37.08 


37.00 


34.17 


30.24 



alphabet (as DNA modeling and chain code problems), our method is definitively 
faster than BIC, giving approximately the same classification accuracy. 

6 Conclusions 

In this paper, probabilistic bisimulation is used to estimate the minimal structure 
of a HMM. It has been shown that starting from a redundant configuration, 
bisimulation allows to merge equivalent states while preserving classification 
performances. Redundant and minimal HMM architectures have been tested 
on two different cases, DNA modeling and 2D shape classification, showing the 
usefulness of the approach. Moreover, our method has been compared with a 
classic model selection scheme, showing comparative performances but with a 
less computational complexity. 
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Abstract. This paper introduces R-SMW, a new algorithm for stereo 
matching. The main aspect is the introduction of a Markov Random 
Field (MRF) model in the Symmetric Multiple Windows (SMW) stereo 
algorithm in order to obtain a non-deterministic relaxation. The SMW 
algorithm is an adaptive, multiple window scheme using left-right con- 
sistency to compute disparity. The MRF approach allows to combine in 
a single functional the disparity values coming from different windows, 
the left-right consistency constraint and regularization hypotheses. The 
optimal estimate of the disparity is obtained by minimizing an energy 
functional with simulated annealing. Results with both synthetic and real 
stereo pairs demonstrate the improvement over the original SMW algo- 
rithm, which was already proven to perform better than state-of-the-art 
algorithms. 



1 Introduction 

Three-dimensional (3D) reconstruction is a fundamental issue in Computer Vi- 
sion, and in this context, structure from stereo algorithms play a major role. 
The process of stereo reconstruction aims at recovering the 3D scene structure 
from a pair of images by searching for conjugate points, i.e., points in the left 
and right images that are projections of the same scene point. The difference 
between the positions of conjugate points is called disparity. 

Stereo is a well known issue in Computer Vision, to which many articles 
have been devoted (see [4] for a survey). In particular, the search for conjugate 
points in the two images is one of the main problem, and several techniques have 
been proposed to make this task more reliable. The search is based on a matching 
process that estimates the ’’similarity” of points in the two images on the basis of 
local or punctual information. Actually, feature-based methods try to associate 
image features (e.g., corners) in the image pair, whereas area-based methods 
try to find point correspondences by comparing local information. The main 
difference lies in the fact the the former approach produces a sparse (i.e., only in 
correspondence of features) disparity map, whereas, the latter generates a dense 
disparity image. In particular, the area-based matching process is typically based 
on a similarity measure computed between small windows in two images (left and 
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right). The match is normally found in a deterministic way, in correspondence 
of the highest similarity value. 

In this paper, a novel probabilistic stereo method is proposed, which is based 
on the Symmetric Multi- Window (SMW) algorithm presented in [7,8]. In this 
algorithm, matching is performed by correlation between different kinds of win- 
dows in the two images, and by enforcing the so-called left-right consistency 
constraint. This imposes the uniqueness of the conjugate pair, i.e., each point on 
one image can match at most one point on the other image. This algorithm is 
fully deterministic and exhibits good performances, as compared with the state 
of the art, namely [15,23]. Still, there is margin for improvement (as our re- 
sults show) by implementing a probabilistic SMW using Markov Random Fields 
(MRFs). To this end, SMW has been re-designed in a probabilistic framework 
by defining a random field evolving according to a suitable energy function. 

Literature about MRFs is large and covers many topics of image processing, 
like restoration, segmentation, and image reconstruction considering both inten- 
sity (video) [3] and range images [1]. In addition, different approaches, closer in 
spirit to ours, were proposed aimed at integrating additional information in the 
MRF model, like, for instance, edges to guide the line extraction process [10], or 
confidence data to guide the the reconstruction of underwater acoustic images 
[19]. 

More specifically, MRFs are also frequently used in computer vision applica- 
tions in the context of stereo or motion. For example, in [14] the motion vector is 
computed adopting a stochastic relaxation criterion, or, in [16], MRFs are used 
to detect occlusions in image sequences. Stereo disparity estimation methods 
using MRFs have been proposed in several papers. In [21], a stereo matching 
algorithm is presented as a regularized optimization problem. Only a simple cor- 
respondence between single pixels in the image pair is considered that leads to 
an acceptable photometric error, without exploiting local information for dispar- 
ity estimation. In [17], matching is based on a fixed single correlation window, 
shifted along raster scan lines, and disparity is calculated by integrating gradi- 
ent information in both left and right images. In this way, a mix between area- 
and feature-based matching is obtained, but occlusions are not managed by the 
algorithm. In [22], an MRF model was designed to take into account occlusions 
defining a specific dual field, so that they can be estimated in a similar way as 
in a classic line process similar to that presented in [12]. 

In our approach, for each pixel, different windows are (ideally) considered to 
estimate the Sum of Squared Differences (SSD) values and related disparities. 
When using area-based matching, the disparity is correct only if the area covered 
by the matching window has constant depth. The idea is that a window yielding 
a smaller SSD error is more likely to cover a constant depth region; in this way, 
the disparity profile itself drives the selection of an appropriate window. 

The main contribution of this paper lies in the definition of a MRF model for 
a non-deterministic implementation of the SMW algorithm, in order to consider 
in a probabilistic way the contributions associated to the several windows, also 
introducing a local smoothness criterion. The initial disparity is computed for 
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each window with SSD matching, then the Winner-Take- All approach of the 
SMW algorithm is relaxed by exploiting the MRF optimization. 

The rest of the paper is organized as follow. In Section 2, the stereo process 
is described, and the MRF basic concepts are reported in Section 3. The actual 
MRF model is detailed in Section 4, and results are presented in Section 5. 
Finally, in Section 6, conclusions are drawn. 



2 The Stereo Process 

Many algorithms for disparity computation assume that conjugate pairs lie along 
raster lines. In general this is not true, therefore stereo pairs need to be rectified 
- after appropriate camera calibration - to achieve epipolar lines parallel and 
horizontal in each image [9]. 

A customary assumption, moreover, is that the image intensity of a 3D point 
is the same on the two images. If this is not true, the images must be normalized. 
This can be done by a simple algorithm [2] which computes the parameters of 
the gray-level transformation 

Ii{x,y) = alr{x,y) + V(x,y) 

by fitting a straight line to the plot of the left cumulative histogram versus the 
right cumulative histogram. 

The matching process consist in finding the element (a point, region, or 
generic feature) in the right image which is most similar, according to a similarity 
metric, to a given element in the left image. 

In the simple correlation stereo matching, similarity scores are computed, 
for each pixel in the left image, by comparing a fixed window centered on the 
pixel, with a window in the right image, shifting along the raster line. It is 
customary to use the Euclidean distance, or Sum of Squared Differences (SSD), 
as a (dis)similarity measure. The computed disparity is the one that minimizes 
the SSD error. 

Even under simplified conditions, it appears that the choice of the window size 
is critical. A too small window is noise-sensitive, whereas an exceedingly large 
one acts as a low-pass filter, and is likely to miss depth discontinuities. This 
problem is addressed effectively - although not efficiently - by the Adaptive 
Window algorithm [15], and by the simplified version of the multiple window 
approach, introduced by [13,11]. 

Several factors make the correspondence problem difficult. A major source 
of errors in computational stereo are occlusions, although they help the human 
visual system in detecting object boundaries. Occlusions create points that do 
not belong to any conjugate pairs (Figure 1). There are two key observations 
to address the occlusions issue: (i) matching is not a symmetric process. When 
searching for corresponding elements, only the visible points in the reference 
image (usually, the left image) are matched; (ii) in many real cases a disparity 
discontinuity in one image corresponds to an occlusion in the other image. Some 
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Left Image Right Image 




Fig. 1. Left-right consistency in the case of the random-dot stereogram of Figure 3. 
Point B is given C' as a match, but C' matches C ^ B. 



authors [6,5] use the observation (i) to validate matching (left-right consistency); 
others [11,2] use (ii) to constrain the search space. 

Recently, a new algorithm has been proposed [8] that computes disparity 
by exploiting both the multiple window approach and the left-right consistency 
constraint. For each pixel, SSD matching is performed with nine 7x7 windows 
with different centers: the disparity with the smallest SSD error value is retained. 

The idea is that a window yielding a smaller SSD error is more likely to 
cover a constant depth region. Consider the case of a piecewise-constant surface: 
points within a window covering a surface discontinuity come from two different 
planes, therefore a single “average” disparity cannot be assigned to the whole 
window without making a manifest error. The multiple windows approach can 
be regarded as a robust technique able to fit a constant disparity model to data 
consisting of piecewise-constant surface, that is, capable of discriminate between 
two different populations. Occlusions are also detected, by checking the left-right 
consistency and suppressing unfeasible matches accordingly. 

In this work we introduce a relaxation of the SMW algorithm using MRF. 
Both the multiple windows and the left-right consistency constraint features are 
kept but, in some sense, they are relaxed. 

As for the multiple windows, one may note that correlation needs to be 
computed only once, because an off-centered window for a pixel is the on-centered 
window for another pixel. Therefore, the multiple windows technique used by 
the SMW reduces to assign a given pixel the disparity computed for one of its 
neighbors with an on-centered window, namely, the neighbour with the smallest 
SSD error. The nine 7x7 windows scheme gives forth to a sparse neighbourhood 
of nine pixels. The idea is to relax this scheme, and consider the pixel with the 
smallest SSD error in a full n x n neighbourhood (Figure 2). 

As for the left-right consistency, we noticed that in presence of large amount 
of noise or figural distortion, the left-right consistency could fail for true conju- 
gate pairs, and points could be wrongly marked as occluded. In this respect, it 
would be useful to relax the constraint, allowing for small errors. 
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<47) (48) (49) 



Fig. 2. The 49 asymmetric correlation windows for a 7 x 7 neighbourhood. The pixel for 
which disparity is computed is highlighted. The off-centered window for the highlighted 
pixel is the on-centered window for another pixel. 



All these requests can be east in terms of cost functions to be minimized over 
a disparity field, and this points strongly to Markov Random Fields. 

3 Markov Random Fields 

In this section will shall briefly review some basic concepts regarding MRF, in 
order to introduee our notation. 

A MRF is defined on a finite lattiee field / of elements i called sites (in 
our ease, image pixels). Let us define a family of random variables D={Di= 
d,, ie /}, and let us suppose that eaeh variable may assume values taken from 
a discrete and finite set (e.g., grey levels set). The image is interpreted as a 
realization of the discrete stochastie process in which pixel i is associated to a 
random variable (being its realization), and where, owing to the Markov 
property, the conditional probability P (di \ depends only on the value 

on the neighboring set of i, (see [12]). 

The Hammcrslcy-Clifford theorem establishes the Markov-Gibbs equivalence 
between MRFs and Gibbs Random Fields [18], so the probability distribution 
takes the following form: 

F(d) = Z-i (1) 

where is a normalization factor called partition function, is a parameter 
called temperature and U(d) is the energy function, which can be written as a 
sum of local energy potentials dependent only on the cliques ce C (local config- 
urations) relative to the neighboring system [12]: 

U{d) = Y^V,{d) ( 2 ) 

c€C 

In general, given the observation g, the posterior probability P(d \ g) can be 
derived from the Bayes rule by using the a-priori probability P(d) and the con- 
ditional probability P(g \ d). The problem is solved computing the estimate d 
according to a Maximum A-Posteriori (MAP) probability criterion. Since the 
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posterior probability is still of the Gibbs type, we have to minimize U(d \ g) 
= U(g I d) + U(d), where U(g \ d) is the observation model and U(d) is the 
a-priori model [12], Minimization of the functional U(d \ g) is performed by a 
simulated annealing algorithm with Metropolis sampler [20] [18]. 

When the MRF model is applied to image processing the observation model 
describes the noise that degrades the image and the a-priori model describes the 
a-priori information independent from the observations, like, for instance, the 
smoothness of the surfaces composing the scene objects. 



4 Model Description 

To deal with the stereo problem the scene is modeled as composed by a set 
of planes located at different distances to the observer, so that each disparity 
value corresponds to a plane in scene. Therefore, the a-priori model is piecewise 
constant [18]. The observation model is harder to define because the disparity 
map is not produced by a process for which a noise model can be devised. 

Whilst the a-priori model imposes a smoothness constraint on the solution, 
the observation model should describe how the observations are used to produce 
the solution. In fact, it is the observation term that encodes the multiple win- 
dows heuristic. For each site, we take into account all the disparity values in 
the neighbourhood, favouring the ones with the smallest SSD. 

First a disparity map is computed using the simple SSD matching algorithm 
outlined in Sec. 2, taking in turn the left and the right images as the reference 
one. This produces two disparity maps, which we will call left and right, respec- 
tively. The left-right consistency constraint is implemented by coupling the 
left disparity and the right disparity values. 

In order to define the MRF model, we introduce two random fields and 
to estimate the left and the right disparity map, two random fields G* and 
G’’ to model the left and the right observed disparity map, and two random field 
S'* and S'' to model the SSD error. The field D* (or equivalently D'’) will yield 
the output disparity. 

In the following we shall describe the MRF functional, by defining the the 
a-priori model, the observation model, and the left-right consistency constraint 
term. In the next two subsections, will shall omit superscript I and r in the field 
variables. It is understood that the a-priori model and the observation model 
applies to both left and right fields. 



4.1 A-priori Model 

With the a-priori term we encode the hypothesis that the surfaces in the scene 
are locally flat. Indeed, we employ a piecewise constant model, defined as: 

iei jeNi 



( 3 ) 
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where and dj are the estimate disparities value (the realization of the field D) 
and the function S{x,y) is defined as: 

5(xv)=l^ ^ = y (4) 

’ (0 otherwise 

This term introduces a regularization constraint, imposing that all pixels 
assume the same value in a region, thereby smoothing out isolated spikes. 



4.2 Observation Model 

In order to mimic the behaviour of the SMW algorithm, the observation model 
term introduces a local non-isotropic relaxation, favouring the neighbour obser- 
vations with the lower SSD value: ^ 




where g is the observation disparity map (the realization of the field G), s is 
the observed SSD values (the realization of the field S), and d is the disparity 
estimate (the realization of the field D). Following [18] the term 1 — 5{x,y) 
represent a generalization of the sensor model for binary surface (derived by the 
binary symmetric channel theory). 

In this term, the estimate value at site i, di, is compared with all its observed 
neighbours {gj}jeNi and with gi. When di takes the disparity of one (or more) 
of its neighbours, one (or more) term(s) in the sum vanishes. The lower is the 
SSD error of the chosen disparity, the higher is the cost reduction. 

Please note that we are not relating the SSD value to the matching likelihood: 
the rationale behind this energy term is the multiple windows idea. 



4.3 Left-Right Consistency Constraint Term 

In our MRF model, besides an observation term and an a-priori term for both 
left and right disparity fields, we define a coupling term that settles the left-right 
constraint. 

Let d\ be the left disparity (i.e., the disparity computed taking the left image 
as the reference) at site i, and d\ the right disparity at site i. The left-right 
consistency constraint states that: 

= ( 6 ) 



The corresponding energy term is: 

V{d^,d^)=J2o{4 + dl^,,^ (7) 

^ Please note that SSD is an error measure, not a correlation or similarity measure. 
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where 6{x) is define as: 



e 




0 

1 



if X = 0 
otherwise 



( 8 ) 



In this way we introduces a payload when the left-right constraint is violated. 



4.4 Final Model 

The final MRF writes: 

U {d\dr I g\s\g-,s^) = fci • [[/ {g\s^ \ S) +U{g^,s^ \ dT)] + 

+ k2-[U {d})+U{dT)] + 

+ h-V{d\dJ) (9) 

where U [g\s^ \ d}) and U {g'^,s^ \ d'~) are the observation model applied to the 
left and right disparity reconstruction, U (d*) , U (d’’) are the a-priori models and 
V (djd’') is the left-right constraint term. Please note that these terms are all 
weighted by the coefficients ki,k 2 ,ks heuristically chosen. 

This model performs both simultaneously the left and right disparity recon- 
struction and the two estimates influence each other in a cooperative way. We 
call our algorithm Relaxed SMW (R-SMW). 

5 Results 

This section reports the main results of the experimental evaluation of our algo- 
rithm. Numerical and visual comparison with other algorithms are shown. 

We first performed experiments on noise-free random-dot stereograms (RDS), 
shown in Figure 3. In the disparity maps, the gray level encodes the disparity. 




Fig. 3. Random-dot stereograms (RDS). The right image of stereogram (not shown 
here) is computed by warping the left one (a), which is a random texture, according 
to given disparity pattern: the square has disparity 10 pixel, the background 3 pixel. 
Disparity map computed with SSD matching (b), SSD values (c) and R-SMW disparity 
map. 
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(a) (b) 



Fig. 4. ” Head” stereo pairs from the Multiview Image Database, University of 
Tsukuba. 




(a) (b) 



Fig. 5. R-SMW disparity map (a) and ground truth (b). 



that is the depth (the brighter the closer) . Images have been equalized to improve 
readability. 

Figure 3.b shows a disparity map computed by SSD matching with 7x7 fixed 
window. The negative effect of disparity jumps (near the borders of the square 
patch) is clearly visible. Accordingly, Figure 3.c shows that along the borders 
the SSD error is higher. R-SMW yields the correct disparity map, shown in 
Figure 3.d. This replicates exactly the result obtained by SMW [7]. We expect to 
appreciate the improvement brought by the MRF when noise affects the images, 
like in the real cases. 

As a real example, we used the ’’Head” stereo pair shown in Figure 4, for 
which the disparity ground truth is given. The output of R-SMW is reported 
in Figure 5. The error image (Figure 6. a) shows in black the pixels where the 
disparity is different from the ground truth. For the others, the disparity value 
is shown as a gray level (the brighter the closer). 

The R-SMW algorithms outputs two optimal disparity maps, the left and 
the right. By checking the left-right consistency against these two maps, one can 
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Fig. 6. Error map (a): the wrong pixels are black, the other take the right disparity 
value. Occlusions map (b). 




(a) (b) 



Fig. 7. Disparity map calculated by Zabih and Woodfill algorithm (a) and through 
SMW algorithm (b). 



detect occluded points. Figure 6 shows the occlusions map. It is worth noting 
that most of the wrong pixels of Figure 6. a comes from occlusions. 

We compared R-SMW with SMW and our implementation of the Zabih and 
Woodfill algorithm (Figure 7). In the R-SMW result, the regions are more ho- 
mogeneous and the edge are better defined. Moreover a lot of spurious points are 
cleared. Please note that there is a large wrong area, corresponding to a box on 
the shelf (top right). The ground truth disparity for that box is the same as the 
background, but all the stereo algorithms we tested agree in detecting a different 
disparity for the box. We suspect that the ground truth could be incorrect for 
that area. 

For quantitative comparison, two error measures have been used: the Mean 
Absolute Error (MAE), i.e, the mean of absolute differences between estimated 
and ground true disparities, and the Percentage Error (PE), i.e., the percentage 
of pixel labeled with a wrong disparity. Table 1 reports the MAE and PE errors 
obtained on the “Head” stereo pair by SSD matching, Zabih and Woodfill algo- 
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(a) (b) (c) 

Fig. 8. ’’Castle” left image (a), SMW disparity (b), and R-SMW disparity (c). 




(a) (b) (c) 



Fig. 9. ’’Parking meter” left image (a), SMW disparity (b), and R-SMW disparity (c). 



rithm, SMW, and by R-SMW. Results are ordered by decreasing errors: R-SMW 
exhibit the best performance, improving over the SMW. The Zabih and Woodfill 
algorithm is very fast (it reaches 30 frames per second on dedicated hardware), 
but results suggest that accuracy is not its best feature. We did not compared 
R-SMW with the Adaptive Windows (AW) algorithm by Okutomi and Kanade 
[15], but in [7] SMW was already proven to be more accurate than AW. 



Table 1. Error values obtained by the different disparity reconstruction algorithm. 



Algorithm 


MAE 


PE 


SSD with fixed window 


0.8569 


30.73% 


Zabih and Woodfill 


0.8012 


31.56% 


SMW 


0.6194 


24.98% 


R-SMW 


0.5498 


21.33% 
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(a) (b) (c) 

Fig. 10. ’’Shrub” left image (a), SMW disparity (b), and R-SMW disparity (c). 




(a) (b) (c) 

Fig. 11. ’’Trees” left image (a), SMW disparity (b), and R-SMW disparity (c). 



We also report in Figures 8, 9, 10, and 12 the results of our algorithm on stan- 
dard image pairs from the JISCT (JPL-INRIA-SRI-CMU-TELEOS) stereo test 
set, and from the CMU-CIL (Carnegie-Mellon University — Calibrated Imaging 
Laboratory). Although a quantitative evaluation was not possible on this im- 
age, the quality of our results seems to improve over SMW, especially because 
spurious points and artifacts are smoothed out. 



6 Conclusions 

We have introduced a relaxation of the SMW algorithm by designing a Markov 
Random Eield where both the multiple windows scheme and the left-right con- 
sistency are embedded. Moreover, a regularization constraint is introduced, as 
customary, to bias the solution toward piecewise constant disparities. Thanks to 
the MRE versatility, all these constraints are easily expressed in terms of energy, 
and the final solution benefits from the trade off between a-priori model and 
observations. Indeed, results showed that R-SMW performs better than state- 
of-the-art algorithms, namely SMW, Zabih and Woodfill and (indirectly) AW. 
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(a) (b) (c) 

Fig. 12. ’’Trees” left image (a), SMW disparity (b), and R-SMW disparity (c). 



A sequential implementation of a MRF is computational intensive, of course: 
our algorithm took several minutes to converge to a solution, on the examples 
shown in the previous section. On the other hand, it reaches a high accuracy, and 
in some applications (eg. model acquisition), one might want to trade accuracy 
for time. 

A drawback of R-SMW is that coefficients in the energy functional need to 
be adjusted heuristically, and their value is fairly critical for the overall quality 
of the result. Work is in progress to implement a procedure for selecting the best 
coefficients automatically. 
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Abstract. In a previous paper we described a system which recursively 
recovers a super-resolved three dimensional surface model from a set of 
images of the surface. In that paper we assumed that the camera cali- 
bration for each image was known. In this paper we solve two problems. 
Firstly, if an estimate of the surface is already known, the problem is to 
calibrate a new image relative to the existing surface model. Secondly, if 
no surface estimate is available, the relative camera calibration between 
the images in the set must be estimated. This will allow an initial surface 
model to be estimated. Results of both types of estimation are given. 



1 Introduction 

In this paper we discuss the problem of camera calibration, estimating the posi- 
tion and orientation of the camera that recorded a particular image. This can be 
viewed as a parameter estimation problem, where the parameters are the camera 
position and orientation. We present two methods of camera calibration, based 
on two different views of the problem. 

1. Using the entire image I, the parameters are estimated by minimizing (/ — 

where O are the camera parameters and I{0) is the image simulated 
from the (known) surface model. 

2. Using features extracted from the image, the parameters are estimated by 

minimizing {u — where u{0) is the position of the estimated feature 

projected into the image plane. 

Under the assumption that a surface model is known, the first method has a 
number of advantages. It makes no assumptions about the size of the displace- 
ments between the images; it gives much more accurate estimation as many 
thousands of pixels are used to estimate a very few camera parameters; most 
fundamentally for our problem, it does not require feature extraction - images of 
natural scenes often do not have the sharp corner features required for standard 
approaches to camera calibration. 

In an earlier paper [1] we described a system that inferred the parameters of 
a high resolution triangular mesh model of a surface from multiple images of that 
surface. It proceeded by a careful modeling of the image formation process, the 
process of rendering, and showed how the rendering process could be linearized 
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with respect to the parameters of the mesh (in that case, the height and albedo 
values) . This linearization turned the highly nonlinear optimization for the mesh 
parameters into the tractable solution of a very high dimensional set of sparse 
linear equations. These were solved using conjugate gradient, using iterative 
linearization about the estimate from the previous iteration. 

The work in [1] required that the camera parameters (both internal and ex- 
ternal) were known, and also assumed that the lighting parameters were known. 
In this paper we continue to assume that the internal parameters are known - 
NASA mission sensors are extensively calibrated before launch - and that the 
lighting parameters are known. Here we will describe how the linearization of 
the rendering process can be performed with respect to the camera parameters, 
and hence how the external camera parameters can be estimated by minimizing 
the error between the observed and synthesized images. We assume the usual 
pinhole camera model [10]. 

To estimate the camera parameters as described above requires that a surface 
model is already available. For the initial set of images of a new region, no surface 
model is available. In principle one could optimize simultaneously over both the 
surface parameters and the camera parameters. In practice, because the camera 
parameters are correlated with all the surface parameters, the sparseness of the 
set of equations is destroyed, and the joint solution becomes computationally 
infeasible. Instead, for the initial set of images we use the standard approach of 
feature matching, and minimize the sum squared error of the distance on the 
image plane of the observed feature and the projection of the estimated feature 
in 3D. 

A surface can be inferred using the camera parameters estimated using fea- 
ture matching. New images can then be calibrated relative to this surface esti- 
mate, and used in the recursive update procedure described in [1]. 

2 Calibration by Minimizing the Whole Image Error 

Consider a surface where the geometry is modeled by a triangular mesh, and 
an albedo value is associated with each vertex of the mesh. A simulated camera 
produces an image I of the mesh. The camera is modeled as a simple pinhole 
camera, and its location and orientation is determined by six parameters, its 
location in space {xc, Vc, Zc)-, the intersection of the camera axis with the x — y 
plane, [xo^yo), and the rotation of the camera about the camera axis, 0. The 
last three of these parameters can be replaced by the three camera orientation 
angles, and we will use both representations in different places, depending on 
which is more convenient. These parameters are collected into a vector 0. 

For a given surface, given lighting parameters, and known internal camera 
parameters, the image rendered by the synthetic camera is a function of 0, ie 
I = I{0)- Making the usual assumption of independent Gaussian errors, and 
assuming a uniform prior on 0, reduces the maximum likelihood estimation 
problem to a least-squares problem 
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Fig. 1. Geometry of the triangular facet, illumination direction and viewing direction. 
Zs is the vector to the illumination source; z„ is the viewing direction 

0 = imn ^(Jp - ip{0)f (1) 

p 

where O is the maximum-likelihood estimate of the camera parameters. Because 
I{0) is in general a nonlinear function of 0, to make this estimation practical 
we linearize I{0) about the current estimate 0 q 



-V ( f ! 

/(0) = /(0o) + Dx; D=^ (2) 

where D is the matrix of derivatives evaluated at 0q, and x = 0 — 0 q. This re- 
duces the least-squares problem in equation 1 to the minimization of a quadratic 
form, F 2 (x), 



T 2 (x) = ^xDD^x^ — bx (3) 

b = (/ - 7(0))D (4) 

which can be solved using conjugate gradient or similar approaches. In the fol- 
lowing section we will describe how an object space renderer can also be made 
to compute D, the derivatives of the pixel values with respect to the camera 
parameters. 



2.1 Forming the Image 

As discussed in [1] , to enable a renderer to also compute derivatives it is necessary 
that all computations are done in object space. This implies that the light from 
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a surface triangle, as it is projected into a pixel, contributes to the brightness of 
that pixel with a weight proportional to the fraction of the area of the triangle 
which projects into that pixel. The total brightness of the pixel is thus the sum of 
the contributions from all the triangles whos projection overlaps with the pixel 

4 = E/a^a (5) 

A 

where is the fraction of the flux that falls into pixel p, and is the total 
flux from the triangle. This is given by 

= pE{a^) cos cos'^ 9 A f2, (6) 

E{a'^) = A{I^ cosa*+I“). 

Af2 = S/(f. 

Here p is an average albedo of the triangular facet. Orientation angles a* and 
a" are defined in figure 1. E{a^) is the total radiation flux incident on the 
triangular facet with area A. This flux is modeled as a sum of two terms. The first 
term corresponds to direct radiation with intensity X® from the light source at 
infinity (commonly the sun) . The second term corresponds to ambient light with 
intensity X“. The parameter 9 in equation (6) is the angle between the camera 
axis and the viewing direction (the vector from the surface to the camera); k is 
the lens falloff factor. Af2 in (6) is the solid angle subtended by the camera which 
is determined by the area of the lens S and the distance d from the centroid of 
the triangular facet to the camera. If shadows are present on the surface the 
situation is somewhat more complex. In this paper we assume that there are no 
shadows or occlusions present. 



2.2 Computing Image Derivatives 

with Respect to Camera Parameters 



Taking derivatives of the pixel intensities in equation 5 gives 



dOi 




do. 



+ ^A 




( 7 ) 



Consider first dd>A/dO,. We neglect the derivatives with respect to the fall- 
off angle, as their contribution will be small, and so it is clear from equation 6 
that the derivative with respect to any of the camera orientation angles is zero. 
The derivative with respect to the camera position parameters is given by 



d^A 

dO, 



oc — — cos a 
oOi 







( 8 ) 



V 
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Fig. 2 . The intersection of the projection of a triangular surface element (io, 11,12) onto 
the pixel plane with the pixel boundaries. Bold lines corresponds to the edges of the 
polygon resulting from the intersection. Dashed lines correspond to the new positions 
of the triangle edges when point Pi^ is displaced by 5 P 



where v is the vector from the triangle to the camera, v = |v|, Oi are the three 
components of the camera position, Zi are unit vectors in the three coordinate 
directions and Zy = v/u (see figure 1). 

Now consider df^/dOi- For triangles that fall completely within a pixel, 
the second term in equation 7 is zero, as the derivative of the area fraction 
is zero. For triangles that intersect the pixel boundary, this derivative must be 
computed. When the camera parameters change, the positions of the projections 
of the mesh vertices into the image plane will also move. The derivative of the 
fractional area is given by 

^/a _ 1 ^^^polygon pp A 

J=l0,ll,l2 ^ 

where Pj is the projection of point Pj onto the image plane, and ^a is the 
area of the projection of the triangle. The point displacement derivatives will be 
detailed below. 

Thus, the task of computing the derivative of the area fraction (9) is reduced 
to the computation of dA^/dPj and ^ApQjygQj^/SPj. Note that the intersection 
of a triangle and a pixel for a rectangular pixel boundary can, in general, be a 
polygon with 3 to 7 edges with various possible forms. However the algorithm 
for computing the polygon area derivatives that we have developed is general, 
and does not depend on a particular polygon configuration. The main idea of 
the algorithm can be described as follows. Consider, as an example, the polygon 
shown in figure 2 which is a part of the projected surface triangle with indices 
io, ii, i 2 - We are interested in the derivative of the polygon area with respect to 
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camera 




Fig. 3. Illustration of the geometry for determining the rotation between world and 
camera coordinates 



the point Pi^ that connects two edges of the projected triangle, (Pi 2 ,Pio) and 
(PiojPi^). These triangular edges contain segments (I, J) and (K, L) that are 
sides of the corresponding polygon. It can be seen from figure 2 that when the 
point Pio is displaced by <5Pio the change in the polygon area is given by the 
sum of two terms 

^■^polygon T L 

These terms are equal to the areas spanned by the two corresponding segments 
taken with appropriate signs. Therefore the polygon area derivative with respect 
to the triangle vertex Pi^ is represented as a sum of the two “segment area” 
derivatives for the 2 segments adjacent to a given vertex. The details of this 
computation will be given elsewhere. 

We now consider the derivatives of the point positions. 



Derivatives of the Position of the Projection of a Point on the Image 
Plane. The pinhole camera model gives 



[AR(P - t)]; 
[R(P - 1)] 



where R is the rotation matrix from world to camera coordinates, t is the trans- 
lation of between camera and world coordinates and A is the matrix of camera 
internal parameters [13]. In the numerical experiments presented here we assume 
that the internal camera parameters are known, and further that the image plane 
axes are perpendicular, and that the principle point is at the origin. This reduces 
A to a diagonal matrix with elements (fci, ka, 1), where ki = —ijlx, ka = 

Where f is the focal length of the lens and and ly are the dimensions of the 
pixels in the retinal plane. 

The rotation matrix R can be written in terms of the Rodrigues vector [10] 
g = {gi, Q2, gs) which defines the axis of rotation, and 9 = j£i| is the magnitude 
of the rotation. (Clearly g can be written in terms of the camera position, the 
look- at point and the view- up vector.) 




Matching Images to Models 111 



R = I-7^- 



where 



, .,, 2(1 -COS0) 


(11) 


1 

CO 
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0 -Ql . 


(12) 



Let H = 'H/0 and = TijQ then 



Qq% 



sin 9 



-Q2 gi 0 



(1 - COS0) 



^ +{Hn. + n.H) ^ 



— H cos 9 — 



sin 9 



n (13) 



+ H^ ^sin0-2 
where Hi — dHIdgi- Then 



1 — cos 9 
9 



dill 

dQi 



Ag(P-t)]^ [AR(P-t)]i [g=(P-t) 



J3 



[R(P - 1)]3 



([R(P-t)]3)2 



(14) 



The derivatives with respect to the position parameters are 

_ [AR(P-t)],[R]3,, [AR],,, 

at, ([R(P-t)]3)2 [R(P-t)]3 ^ ^ 

In practice, optimization using the camera orientation angles directly is inad- 
visable, as a small change in the angle can move the surface a long distance in 
the image, and because the minimization in equation 1 is based on a sum over 
all pixels in the image, this can make for rapid changes in the cost, and failure 
to converge. Instead we use the “look-at” point {xo,yo) which is in the natural 
length units of the problem. The conversion of the derivatives from angles to 
look-at is an application of the chain rule, and is not detailed here. 

We now consider the second problem, calibration using features detected in 
the images. 



3 Calibration by Minimizing Feature Matching Error 

It is well known that camera calibration can be performed using corresponding 
features in two or more images [4]. This estimation procedure also returns the 
3D positions of the corresponding image features. So the parameter space is 
augmented from 0^ (where / indexes the frame, or camera parameter set) with 
P„, the positions of the 3D points. 

If it is assumed that the error between u{, the feature located in image / 
corresponding to 3D point P„, and u{, the projection of P„ into image / using 
camera parameters 0^ , is normally distributed, then the negative log- likelihood 
for estimating 0 and P is 
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Fig. 4. Four synthetic images of an area of Death Valley 



K 2 



WeV) = E E E<> 


‘‘Ln 


(16) 


/=1 1—1 






P = {P„; n = l... 


,7V} 


(17) 


0 = {0{:i = l.. 


.6; f = l...K} 


(18) 



where is the set of features that are detected in image /. Note that the 
features detected in a given image may well be a subset of all the P„’s. I indexes 
the components of u. This form of the likelihood assumes that there is no error 
in the location of the features in the images. 

Typically the non-linear likelihood in equation 18 is minimized using a stan- 
dard non-linear minimization routine, for example the Levenberg-Marquardt al- 
gorithm [11], The dimensionality of the parameter space in equation 18 is large, 
equal to 6x (the number of images -1) + 3x the number of 3D points -1, where 
the parameters of the reference camera are not included, and the overall spatial 
scale is arbitrary. In general there will be many points, because of this large di- 
mensionality it is important to use exact derivatives, to avoid slow convergence 
and reduce the need for good initialization. In section 3.2 below we derive the 
analytic derivatives of the likelihood function, enabling this robust convergence. 
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Fig. 5. “Strength” of the features found in image 0 



The other practical problem in using feature-based camera calibration is the 
detection and matching of image features. 



3.1 Robust Feature Matching 

The maximum likelihood solution to the camera parameter estimation problem 
is known to be extremely sensitive both to mismatches in the feature correspon- 
dences, and to even small errors in the localization of the detected features. To 
reliably estimate the camera parameters we need reliably loeated features, reli- 
ably matehed. More accurate estimation results from using a smaller set of well 
localized and well matched features, than a much larger set that includes even 
a single outlier. Extreme conservatism in both feature detection and matching 
is needed. 

The feature detector most commonly used is the Harris corner detector [2]. 
This feature detector was developed in the context of images of man-made envi- 
ronments, which contain many strong corners. Remote sensed images of natural 
scenes of the type we are concerned with (see figure 4) contain some, but much 
fewer, strong corner features. If the “feature strength” given by the Harris detec- 
tor is plotted for the features detected on the images in figure 4, it can be seen 
to fall off rapidly - see figure 5. For this type of image, it is therefore necessary 
to use only a small number of features, where the associated feature strength is 
high enough to ensure accurate feature localization. This is a limiting factor in 
the use of feature based calibration for this type of images, and a motivation for 
developing the whole image approach described above. Feature detectors more 
suited to natural scenes are clearly needed, but there will always be particularly 
mute environments where feature based methods will fail. 

Because of the extreme sensitivity to mismatches, it it necessary to ensure 
that the features found are matched reliably. This is a classic chicken-and-egg 
problem: to determine reliable matches it is necessary to correctly estimate the 
camera parameters, and to correctly estimate the camera parameters requires 
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reliable matches. For this reason, feature matching has spawned a number of 
methods based on robust estimation (and one approach which attempts to do 
away with explicit feature matching altogether [12]). 

The RANSAC algorithm [3] finds the largest set of points consistent with a 
solution based on a minimal set, repeatedly generating trial minimal sets until 
the concensus set is large enough. Zhang [4] also bases his algorithm on esti- 
mation using minimal sets, but uses LMedS (Least Median Squares) to select 
the optimal minimal set. The estimate of the fundamental matrix [10] generated 
using that minimal set is used to reject outliers. Zhang also applies a relaxation 
scheme to disambiguate potential matches. This is a formalization of the heuris- 
tic that features nearby in one image are likely be close together in another 
image, and in the same relative orientation. In our work we use a modifica- 
tion of Zhang’s algorithm described below. Torr and co-workers have developed 
MLESAC [7] and IMPSAC [6] as improvements on RANSAC. IMPSAC uses a 
multiresolution framework, and propagates the probability density of the camera 
parameters between levels using importance sampling. This achieves excellent re- 
sults, but was considered excessive for our application, where prior knowledge 
of the types of camera motion between frames is known. 

Our algorithm proceeds as follows: 

1. Use the Harris corner detector to identify features, rejecting those which 
have feature strength too low to be considered reliable. 

2. Generate potential matches using the normalized correlation score between 
windows centered at each feature point. Use a high threshold (we use t = 0.9) 
to limit the number of incorrect matches. 

3. Use LMedS to obtain a robust estimate of the fundamental matrix: 

— Generate an 8 point subsample from the set of potential matches, where 
the 8 points are selected to be widely dispersed in the image (see [4]). 

— Estimate the fundamental matrix. Zhang uses a nonlinear optimization 
based approach. We have found that the simple eight point algorithm [9] 
is suitable, because our features are in image plane coordinates, which 
correspond closely to the normalization suggested in [8]. 

— Compute the residuals, Fu for all the potential matches, and store 
the median residual value. 

— Repeat for a new subsample. The number of subsamples required to 
ensure with a sufficiently high probability that a subset with no outliers 
has been generated depends on the number of features and the estimated 
probability that each potential match is an outlier. 

— Identify as the fundamental matrix which resulted in the lowest 

median residual value. 

4. Use Fjjjjjj to reject outliers by eliminating matches which have residuals 
greater than a threshold value. (We used = 1.0 pixels.) 

5. Use the following heuristic to eliminate any remaining outliers: because the 
images are remote sensed images, the variations in heights on the surface 
are small in comparison to the distance to the camera. So points on the 
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surface that are close together should move similar amounts, and in similar 
directions^. The heuristic is 

— Consider all features within a radius r = 0.2 of the image size from the 
current feature. The match found for that feature is accepted if both of 
the following conditions hold: 

(a) The length of the vector between the features is less than a thresh- 
old times the length of the largest vector between features in the 
neighbourhood. 

(b) The average distance between neighbouring features in one image is 
less than a threshold times the average distance between the same 
features in th second image. 

In both cases the threshold used was 1.3. 

Features are matches between all pairs of images in the set, and are used in the 
likelihood minimization (18) to estimate the camera parameters. Note that not 
all features will be detected in all the images in a set, so the likelihood will only 
contain terms for the features actually found in that image. 



3.2 Computing Derivatives of the Feature Positions 

To effectively minimize L{0, P) in equation 18 we need to compute its deriva- 

du^ du^ 

tives, which reduces to computing and ^p" . In what follows we will con- 

centrate on one frame and drop the / index. 

We have already shown in equation 14 the expression for the derivative of the 
point position with respect to the rotation angles and camera position, which 

du^ 

together make up ^^’7 . It remains only to give the expression for the derivative 

with respect to the 3D feature point. This is the same as the derivative with 
respect to the camera position, see equation 15, but with the sign reversed, 
giving 

dui,n ^ [AR(Pn - t)];[R]3,j [AR];^ 

9P„,, ([R(P„-t)]3)2 [R(P„-t)]3 ^ ^ 

Where the subscript n indexes the features. 

4 Results and Conclusions 

Figure 4 shows four synthetic images of Death Valley. The images were generated 
by rendering a surface model from four different viewpoints and with different 
lighting parameters. The surface model was generated by using the USGS Digital 
Elevation Model of the area for the heights, and using scaled intensities of a 
LANDSAT image as surrogate albedos. The size of the surface was approximately 

^ This is not true if the camera moves towards the surface, but even then, if the 
movement towards the surface is not excessive, this heuristic still approximately 
holds. 




116 



Robin D. Morris, Vadim N. Smelyanskiy, and Peter C. Cheeseman 




Fig. 6. Features found in image 0 



350 X 350 points, and the distance between grid points was taken as 1 unit. The 
images look extremely realistic. 

Table 1 shows the results of estimating the camera parameters using features 
detected in the images. These estimates are good, but far from exact. Considering 
the images in figure 4, it is clear that there are few strong corner features. Figure 
6 shows the set of strong features detected in image 0 (the top left image in figure 
4). Two things are apparent. Firstly, the features are all due to rapid changes in 
albedo. Secondly, with two exceptions, the features are clustered. This clustering 
reduces the accuracy of the estimation. That the features are mostly albedo 
features confirms that feature based approaches are not applicable to many of 
the types of image we are interested in. 

Table 1 also shows the results for calibration using matching to a pre-existing 
3D surface model. The estimation was initialized at the results of the point- 
matching estimation. The minimization of (/ — 7(0))^ was performed iteratively, 
re-rendering to compute a new I at the value of 0 at the convergence of the 
previous minimization. As expected, the estimates are very significantly better 
than the results from point matching, and are very accurate. However, these 
results are predicated on the existence of a surface model. 

These results suggest an approach to camera calibration that is the subject 
of our current, ongoing, research. Point matching can be used to estimate initial 
camera parameters, and a very sparse surface representation. A dense surface 
(shape and albedo) can then be inferred using these camera parameters (see 
[1]). The whole-image matching approach can be used to re-estimate the camera 
parameters, and a new surface estimate can be made using the new camera pa- 
rameter estimates. This process can be iterated. The convergence of this iterative 
procedure is currently being studied. 
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Table 1. Results for camera parameter inference. Image 0 was the reference image 
with parameters camera - (—300, 1416,4000), look at - (205, 1416,0) and view up - 
( 0 , 1 , 0 ) 







true 


point-match estimate 


whole-image estimate 


image 1 


camera 


(700, 1416,4000) 


(610,1410,4030) 


(685,1409,4001) 


look at 


(205,1416,0) 


205,1420,0) 


(205,1416,0) 


view up 


(0,1,0) 


(-0.005,1,0,002) 


(0,1.0,0.002) 


image 2 


camera 


(200, 900, 4000) 


(200, 968, 4050) 


(203,894,3996) 


look at 


(205,1416,0) 


(206,1410,0) 


(205,1416,0) 


view up 


(0,1,0) 


(-0.015,0.994,0.11) 


(0,0.993,0.129) 


image 3 


camera 


(200, 1900, 4000) 


(176,1780,4030) 


(196, 1881,4001) 


look at 


(205,1416,0) 


(206,1420,0) 


(205,1416,0) 


view up 


(0,1,0) 


(-0.007,0.996,-0.090) 


(0,0.993,-0.116) 
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Abstract. To segregate overlapping objects into depth layers requires 
the integration of local occlusion cues distributed over the entire image 
into a global percept. We propose to model this process using hierar- 
chical Markov random field (HMRF), and suggest a broader view that 
clique potentials in MRF models can be used to encode any local decision 
rules. A topology-dependent multiscale hierarchy is used to introduce 
long range interaction. The operations within each level are identical 
across the hierarchy. The clique parameters that encode the relative im- 
portance of these decision rules are estimated using an optimization tech- 
nique called learning from rehearsals based on 2-object training samples. 
We find that this model generalizes successfully to 5-object test images, 
and that depth segregation can be completed within two traversals across 
the hierarchy. This computational framework therefore provides an in- 
teresting platform for us to investigate the interaction of local decision 
rules and global representations, as well as to reason about the rationales 
underlying some of recent psychological and neurophysiological findings 
related to figure-ground segregation. 



1 Introduction 

Figure-ground organization is a central problem in perception and cognition. It 
consists of two major processes: (1) depth segregation - the segmentation and 
ordering of surfaces in depth and assignment of border ownerships to relatively 
more proximal objects in a scene [15,26,27]; (2) figural selection - the extraction 
and selection of a figure among a number of ‘distractors’ in the scene. Evidence of 
both of these processes have been found in the early visual cortex [17,19,20,36]. 

In computer vision, figure-ground segregation is closely related to image seg- 
mentation and has been studied from both contour processing and region pro- 
cessing perspectives. Contour approaches perform contour completion based on 
good curve continuation [11,12,24,32,33], whereas region approaches perform im- 
age partitioning based on surface properties [28,30,37,39]. 

Here, we focus on the issue of global depth segregation based on sparse oc- 
clusion cues arisen from closed boundaries. The importance of local occlusion 
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cues in determining global depth perception can be appreciated in our remark- 
able ability in inferring relative depths among objects in cartoon drawings (Fig. 
la). These sparse occlusion cues provide important constraints for the emergent 
global perception of figure and ground. The formation of global percepts from 
such local cues and the computation of layer organizations have been modeled 
as an optimization process with a surface diffusion mechanism [8,9.22]. 

In this paper, we extend these earlier works [8,9,22] by embedding explicit 
decision rules for contour continuation and surface depth propagation in local 
units of a Hierarchical Markov random field model. The multiscale hierarchy is 
sensitive to the topology of image structures and is used to facilitate rapid long 
range propagation of local cues. We also develop a parameter learning method 
using linear programming to estimate the parameters that encode the relative 
importance of those decision rules. Results show that parameters learned on a 
few two-object training samples can generalize successfully to multiple-object 
images. 

The rest of the paper is organized as follows. Section 2 describes the problem 
and expands our method in detail. Section 3 shows our results on a new test 
image. Section 4 concludes the paper with a discussion. 



2 Methods 

2.1 Problem Formulation 

For simplicity, we take an edge map (Fig. lb) with complete and closed contours 
of rectangular shapes as input to our system. These shapes can overlap and 
occlude one another. The occluded part of an object is not visible. The system 
is to produce two complementary maps as output (Fig. If): a pixel depth map 
(Fig. Id) where a higher depth value is assigned to pixel depth units of a more 
proximal surface and a lower value to pixel units of a more distant surface; 
and an edge depth map in which the edge depth units at the border of a more 
proximal surface assume a higher value. The edge depth units assume the same 
depth value as the pixel depth units of the surface to which they belong (Fig. le). 
These two representations are sufficient to specify the depth ordering sequence 
of objects in the scene. 

In general, it is not possible to recover the exact depth ordering or overlap 
sequence in the scene since the solution is not unique. For example, there can 
be multiple choices when objects do not occlude each other directly (object 1 
and 2 in Fig. lb) and when we cannot tell which object is occluding which 
(object 3 and 4 in Fig. lb). If we represent visible pairwise object occlusion 
relationships in a directed graph (Fig. Ic), these two cases correspond to the 
existence of unconnected siblings of the same parent. Instead of recovering the 
overlap sequence, we can sort object depths into layers, ordered by occlusion. 
This problem is called the 2. ID sketch in [28]. If there is a directed cycle in the 
graph, then the depth cannot be segregated into layers. We define the depth 
assignment solution to be the set of smallest depth labels that satisfy all the 




120 



Stella X. Yu, Tai Sing Lee, and Takeo Kanade 




d. Pixel depth label, e. Edge depth label, f. Goal of our model. 



Fig. 1. Segregate depth into layers. Rectangular objects are numbered in b. Darker 
object surfaces/edges are in front of lighter object surfaces/edges in d/e. Given an 
edge map as input, our model produces two complementary depth maps as output. 



visible occlusion relationships. For example, object 4 in Fig. Ic is on layer 1 
rather than layer 2. 

2.2 MRF Model 

Segregating depth into layers is a global process which requires the information 
to be integrated over the entire image. A change of configuration in a small area 
can influence the depth labeling at a distance. On the other hand, there exist 
critical local cues such as T and L junctions which give rise to 3D percepts. If 
each of these cues can be clearly classified and labeled, and there is an unique 
association between these cues and 3D depth, depth labeling can be solved by 
logical inference, for example, using the occlusion graph in Fig. Ic. However, 
there is always uncertainty in identifying local cues in real images and there is 
no universal rule of association between a low level cue and a high level percept. 
The ambiguity in this association is reduced with an increase in the range of in- 
tegration. For example, two L-junctions can be configured to form a T-junction 
which is not related to occlusion. The meaning of this T-junction can be disam- 
biguated by gathering information from the origins of the arms and stem of the 
T-junction. 

Long range influence can be mediated by local computation using MRF 
[10,21]. An MRF is defined over a graph Q, which is determined by its site 
set S and neighborhood system rj. S ^ U Z'^^, where Z^ is an m x m pixel 
lattice and Z'^ is its dual lattice consisting of an m x (m — 1) and an (m — 1) x m 
interleaved grids for line sites [10]. The eoupled neighborhood of a site includes 
both its peer sites and dual sites, as illustrated in Fig. 2. 

Given an edge map, g : Z'^ >->■ with 1 and 0 indicating the 

presence and absence of an edge respectively, we would like to find a depth map 
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a. 77(pixel site) b. 77(horizontal line site) c. r 7 (vertical line site) 



Fig. 2. The neighborhood system 77 used in the model. 



on both pixel and line sites, h : U Z'^ ^ {0,oo}’"^ U {1, with 

the depth layer numbered from 0 (for background). To model the depth segre- 
gation by MRF, we need to specify clique potentials V'c(tu), lu being a particular 
configuration of an MRF, and c being a clique defined as a subset of sites, which 
consists of either a single site or more sites where any two of them are neighbors. 
The probability P{iu) can be written as 

P(c) = ^^, C/(c) = ^V;(ca), 



where Z is called the partition function and U (cu) the energy function. 

MRF’s have been widely used in texture modeling [5], as well as in image seg- 
mentation [10,13]. In texture modeling, the clique potentials are used to model 
the probability of co-occurrence of subsets of pixels [5] or capture marginal prob- 
ability distributions in terms of filter responses [38]. In image segmentation, it 
is closely related to the energy functional approaches [4,10,25] and the clique 
potentials are used to encode smoothness priors [10]. In our formulation below, 
we generalize the idea of multi-level logistic models, and suggest a broader view 
that clique potentials can be more general so that they can encode arbitrary 
local decision rules. 



2.3 Encoding Local Decision Rules 



To model depth segregation process in MRF, we seek to make correct depth 
labeling correspond to the most probable configurations or equivalently config- 
urations of the minimum energy. 

Let X and 7 denote two indicator functions, which map from {True, False} to 
{1,0} and { — 1, 1} respectively, q)-) = 1 — 2x(-)- Let ^ denote the sign function, 
which takes on —1, 0, 1 for negative, zero and positive numbers respectively. The 
line site a between pixel i and j is denoted by a = i o j and conversely, the set 
of pixels associated with the line is denoted by a° = {i,j), with i and j ordered 
from left to right or from top to bottom. In particular, {i,j) o (i,j -f 1 ) and 
(i, j) o {i + 1 , j) are abbreviated as (i,jo) and (io, j) respectively. Using these 
symbols and notations, we can define Vc{h\g) to encode our prior knowledge in 
terms of 10 local rules. 
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Vc{h\g) 




= Ea=poj)6c/^l • = hj) ■ X{9a = 0) 


(rule 1) 


+ Ea=poj)ec /?2 ■ l{h, 7 ^ hj) ■ xi9a = 1) 


(rule 2) 


+ Ea=poj)ec/^3 ■"f(ha= max(/i„/ij)) ■ xida = 1) 


(rule 3) 


+ J2ia=^OJ,b=kol)e£ Pi • 'tiha = h) ■ x(9a = 5b = 1) 


(rule 4) 


+ J2{a=ioj,b=kol)ed ' ^ICihi ^ hj) = C(^fc ^ ^i)) 


(rule 5) 


■x{hi # hj, hk yf hi) ■ xida = 5b = 1) 




+ E(a=iofc,6=iofe)Gc‘= Pe ■ liha = hb) ■ xida = 5b = 1) 


(rule 6) 


+ E(a=iofc.b=jofe)sc‘= Pr ■ l{c{h^ - hk) = C{hj - hk)') 


(rule 7) 


■x{ha = hb) ■ xi9a = 5b = 1) 




+ '^{a=ioj,b=kol,u=joLv=iok)€c* P^ ' {ft(ha > ^u) + l{hb > hu)) 


(rule 8) 


■x(^C(hi - hj) = 1 U ({hk - hi) = ■ x{9a = 9 b = 9 u = ^ r\ g.^ 


= 0) 


+ '^(a=ioj,b=kol,u=jokv=iok)€c* P^ ' {^Phi > hj) + j{hk > hi)) 


(rule 9) 


■x[({ht - hj) = 1 U ({hk - hi) = ■ xi9a = 9 b = 9 u = ^ r\ g.^ 


= 0) 


^^{a=ioj,b=kol,u=joLv=iok)^c^ /^lO ^0 


(rule 10) 



■X{9a = 9u = ^'^9b = 9v=^) 

where c\c^, c* are the sets of cliques for aligned lines, corners and crosses: 



c' = {(a,6) : a = {io,j),b = (ioj + l);a = (i,jo),b= {i + l,jo),a,b e c}, 
c'" = {{a,b) : a = (io, j),6 = {k,lo), |z - fc| < 1, \j - l\ <l,a,be c}, 

= {(a, 6, u, v) : (a, b) (L c\ (u, v) £ {a, b} H {tt, u} = 0, a° U &° = u° U u°}. 

The two indicator functions, x and 7, enable us to embed the conjunction of 
if conditionals into the clique potentials. Let us decode rule 1 as an example. 
Consider the line site a between pixel i and j. If the clause {ga = 0) is not 
true, i.e. there is an edge between the two pixels, then this first term is zero, 
no action will be taken; otherwise, if the clause (/i., = hj) is also true, i.e. the 
pixel depth values at the two sites are equal, then the term produces a reward of 
— /?i, lowering the energy. However, if it is not true, i.e. the depth values at the 
two pixel sites are different, then Vc{h\g) gets /?i on this term as a punishment, 
increasing the energy. Here we require all /?s to be positive. These 10 rules are 
summarized in Table 1 and they can be classified into 6 groups as follows. 

Group 1: Depth continuity within surface. Rules 1 and 10 assert that surface 
depth units in adjacent locations should be continuous. Adjacency is defined 
on two kinds of neighborhood. Rule 1 is concerned with the first order neigh- 
borhood (up, down, left and right neighbors), and rule 10 is concerned with 
the second order neighborhood (diagonally adjacent pixels). 

Group 2: Depth discontinuity across edges. Rule 2 asserts that when there is an 
edge between two adjacent locations, the surface depth units in those two 
locations must have different depth values. 
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Table 1. Encoding rules in clique potentials. Each of these /3 terms encodes a logic 
rule, which in general reads like this: if current clique configuration does not satisfy 
condition A, it gets a score of 0; otherwise, if condition A is satisfied, pattern B is 
expected; if B is also satisfied, then it gets a negative score — C; otherwise it gets a 
positive score C. a, b, u and v are labels for line sites while i, j, k, I are labels for pixel 
sites in the cliques. 



Configuration 


Condition A 


Pattern B 


Score C 


# 


Meaning 


0 1 0 

1 a j 


ga =0 


hi = hj 


/3i 


1 


Depth continues in surface. 


9a = l 


^ hj 


/?2 


~Y 


Depth breaks at edges. 


9a = l 


hi ^ hj 


P3 


W 


Edges belong to surface in front. 


k b 1 
o 1 0 

0 1 o 

1 a j 


9a — 9h — ^ 


ha = hh 


04 


4 


Depth continues along contour. 


II 

S'tp'tt 

II 


C{hi hj ) 
C(^fc hi) 


05 


5 


Depth polarity continues 
along contour. 


3 

o, 

0 1 o ^ 

1 a k 


9a — 9b — 1 


ha = hh 


06 


IT 


Depth continues around corners. 


9a — 9b — ^ 
ha = hb 


C{hi — /ifc) 
C{hj hf^) 


07 


7 


Depth polarity continues 
around corners. 


k b 1 

0 1 o 

vy 1 

1 a j 


9a — 9b — ^ 

9u — 1 1 9v ~ 0 
C^Qii - hj) = 1 or 
C{^k -hi) = l 


ha ^ hu 


08 




Depth breaks on edges 
at T-junctions. 


hjj ^ hu 


08 


hi > hj 


09 


~9~ 


Depth breaks in surface 
at T-junctions. 


hk > hi 


09 


9a = 9a = ^ OV 

9b — 9a = ^ 


hi = hi 


010 


10 


Depth continues in surface. 



Group 3: Border-ownerships. Rule 3 specifies that an edge depth unit shares 
the same depth value as the surface that owns it. 

Group 4: Depth continuity along contour. Rules 4 and 6 specify the edge depth 
value along contour or corners should be continuous. 

Group 5: Depth polarity continuity along contour. Rules 5 and 7 specify the 
depth polarity of surface units across an edge unit should be continuous 
along contour and corners. 

Group 6: Occlusion relationships at T-junctions. At those T-junctions, rule 8 
and 9 specify that the arms of the T are in front of the T stem. 

In this formulation, the clique potentials no longer simply specify local co- 
occurrence, smoothness constraints or filter response histograms as in other MRF 
models, but are generalized to encode a set of local decision rules. From neural 
modeling perspective, the units in the network are not neurons with linearly 
weighted inputs and sigmoidal activation functions, but are capable of perform- 
ing complicated logical computations individually. Recent findings and models in 
cellular neurophysiology [1,18,23] suggest neurons are capable of computations 
more sophisticated than previously assumed. 

The relative importance of the weights /3s in the depth segregation can be 
estimated using a variety of methods. We will describe a particular supervised 
learning method we use in a later section. 
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2.4 Multiscale Hierarchy 

The MRF model described above suffers from being myopic [14] in local com- 
putation and sluggish at propagating constraints between widely separated pro- 
cessing elements [31]. This problem can be overcome by embedding the MRF in 
a hierarchy using multigrid techniques. 

We build an edge map pyramid by down-sampling with a factor of 2 (Fig. 
3). Assuming m = 2^ + 1, we preserve spatial locations at the center and 
the boundary of the lattices throughout the levels of the hierarchy. Let r]^ 
and denote the neighborhood and lattice at level 1. Let " and ” address 
the correspondence between pixels at level I and I + 1, such that i G 
and i G or, i G Z!^ and i G Z!^^ point to the same spatial location on 
the sampling grid (Figure 3c). The edge map at a high level is determined by 

sioj +9ioj) ■ IC(^i ^ + ^ ^)|). 
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a. b. c. sampling 



Fig. 3. Hierarchical edge maps and illustration of sampling, a. Edge map at some level 
1. b. Edge map at a higher level 1 + 1. c. The hierarchy is built by downsampling pixels 
(filled circles) by a factor of two and inferring new lines (darker lines) between sampled 
pixels based on the depth and edge maps at a lower level. 



If an edge is considered to disconnect two neighboring pixels, the above oper- 
ation preserves connectivity when there is only one edge separating two sampled 
pixels. However, when there are two edges in a local neighborhood, the depth 
polarity of the edges has to be considered (Fig. 4). When two nearby edges have 
the same polarity, they can be merged into one edge of the same polarity as in 
(Fig. 4a). When the two edges have opposite depth polarities as in (Fig. 4b), 
they would disappear at the next level of the hierarchy. In this way, relaxation at 
each resolution deals with topologically equivalent diffusion processes and thus 
the same procedure can be applied. 

The intergrid transfer functions involve restriction ft and extension JJ-. 

During the restriction, smoothing is carried out on connected pixel sites. For 
line sites, the smoothing on aligned horizontal (vertical) edges is blocked by 
vertical(horizontal) edge neighbors: 

h\ = max = 0, fc G n | 
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Fig. 4. The operation in the multiscale hierarchy takes edge depth polarity into consid- 
eration. a. Edges of overlapping shapes have the same depth polarity and are preserved 
at a coarser resolution. Edges of abutting shapes that have opposite depth polarities 
will disappear at a coarser resolution, as indicated by the disappearance of the two 
edges between the two shapes at the coarser scale. In this way, relaxation at each 
resolution deals with topologically equivalent diffusion processes and thus the same 
procedure can be applied. In both a and b, the pictures on the left and right indicate 
the images at a fine and coarse resolution respectively. 



h\~ ~ , = max • 
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9(p,q)o(i,q) 






= 0,p e [i - l,i -f l],g 






can be defined in a similar fashion. Median filtering ean also be used 
in the above. During the extension, the information is seleetively transferred to 
a fine grid. The dual operation of smoothing is diffusion, which is subject to 
boundary blockage: 
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Finally, to complete our HMRF model, we provide site visitation and a multi- 
level interaetion scheme. A eomplete sweep of all the sites includes four checker 
board update schemes on first pixel sites and then line sites. The separate vis- 
itation to pixel sites and line sites allows each of the two MRF’s to develop 
fully in itself so that the resultant configuration provides enough driving foree 
for the other to change accordingly. The hierarchy is visited bottom-up through 
restriction and then top-down through extension. The MRF at each level carries 
out a relaxation proeess until its configuration converges. When the configura- 
tion at the lowest level does not change after visiting the entire hierarchy, that 
configuration is the final result. 

In summary, multiscale not only helps to speed up eomputation, but also 
helps propagating sparse depth cues at boundary to the interior of the surface 
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by longer range interactions at higher levels of the hierarchy. In addition, at each 
level of the hierarchy, we repeat the same relaxation operation of local decision 
rules. This relies on the consistency of topology in the restriction and extension 
operations. 



2.5 Parameter Estimation 



The above HMRF model has unknown parameter /3 = [/3i, ■ ■ • , /?io]^- The major 
difficulty in estimating MRF parameters lies in the evaluation of the partition 
function. There are several approaches to deal with the problem [21]. One way is 
to avoid the partition function in the formula, such as pseudo-likelihood [3] and 
least squares(LS) fit [5]. Another way is to use some estimation techniques such 
as the coding method, mean field approximation [35] and Markov Chain Monte 
Carlo maximum likelihood [6]. The approach we take here is to derive a set of 
constraints on /? using a method called learning from rehearsals and use linear 
programming to obtain the (3 that satisfy these constraints. 

This perturbation-based method is most closely related to the LS fit approach 
[5]. Let Uk{uj) denote the sum of clique potentials Vc{ta) over all cliques containing 
site k. Since Vc(lu) is a linear function of /3, so is C/fc(cu). In general, it can be 
written as Uk{uj) = x{ui,k) ■ (3, where x{uj,k) can be obtained by evaluating 
clique potentials on the configuration uj confined to the neighborhood of k. In 
the LS approach, the probabilities of training samples are utilized to derive a 
set of equalities based on the formula below. 



In 



PjoJk — ) 

P(wfc = 



) = -[C/fe(cu) - Uk{uj')] 



x{u>, k) 



x(iu', k) 



■13, 



where Uk = f, = j, ais\{k} = given. However, this is only applicable 

to the case where P{uJk = j\oJrjk) > 0. This condition may not be very restrictive 
in texture modeling, but it is in our model because when is set, tok is often 
determined as well. Another problem concerns numerical stability. When P{iOk = 
is small, the estimation is not accurate. To relax this condition, we derive 
inequality constraints on /3 instead: 



x(w, k) 



x(lj', k) 



•/3<0,if P{iOk 



> P{^k =ik»7J- 



We do not need to know the exact sizes of the two probabilities, but rather the 
relative order of the two quantitities. In other words, for a given neighourhood 
configuration , if we know label i is preferred to label j for site k, we obtain a 
constraint which ensures that site k assuming value of i leads to a lower energy. 

We obtain two sets of constraints on (3 in the form of above inequalities. We 
generate a set of images which have two randomly positioned rectangular shapes. 
Both the edge map g and the final depth map h are known for each training 
image. The first set of constraints come from the fact that given neighbors of 
a site assuming correct labels, this site prefers its own correct label. This will 
map the correct labeling into a local minimum in the configuration space. We 
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summarize all such constraints into A • /3 < 0, where the rows of A come from 
the perturbation on the teacher map h at all sites: 



x{u!, k) 



x{uj', k) 



■ /3 < 0,for F(wfe|Wr;J > 



where ujk = = /i^ ± = hs\[k}- An example on an 

L-junction is given in Fig. 5. As can be seen in the example, the first set of 
constraints are usually satisfiable as the correct label is far better than any 
other choices according to the rules we encode in the energy function. 



o o o 

o|« • 
oj* • 

a. Configuration. 



-4 -4 -4 0 -2 0 -2 0 0 -2 
-4 0 -4 0 0 0 0 0 0 -2 
b. Constraints. 



•/3 < 0 



Fig. 5. Derive the first set of constraints from teacher depth maps. a. An L-junction 
at a pixel site’s neighborhood. The teacher depth map in this neighborhood is 0 for 
unfilled circles and 1 for all the line sites and filled circles, b. Two constraints obtained 
by perturbation on the depth value of the center pixel site. The first constraint comes 
from the difference in the energy functions for labeling 1 and 0 at the center pixel. The 
second constraint comes from the difference in the energy functions for labeling 1 and 
2 at the center pixel, all its neighbors assuming correct labels. These two constraints 
on f3 are trivial as any /3 > 0 is feasible. 



The first set of constraints only guarantee local behaviors when the sys- 
tem is close to the optimal configuration. They may not be enough to drive 
an initial configuration toward that final optimal configuration. A second set of 
constraints are derived for this purpose. This is not easy because there are many 
possible different paths of evolution from one configuration into another, and 
we do not necessarily know the intermediate configurations that the system has 
to go through in order to arrive at the final state. We develop a method called 
learning from rehearsals to overcome this difficulty. Not knowing /? in advance 
or teacher depth maps at intermediate steps, we use the following principle to 
choose a preferred label during the learning process and to establish its validity 
by rehearsing. The principle is that a site’s depth value should be as close as 
possible to its final target value at that site subject to the dragging force from 
its current neighborhood configuration. That is, the derivation of the second 
set of constraints is based on finding the most effective intermediate states that 
will move the system from the initial state to the final state with a minimum 
number of steps. Once a preferred depth label is chosen, we can derive plausible 
constraints in a similar way as we did in Fig. 5. We build a constraint database 
during learning. Whenever a new constraint is to be added into the database, 
we check its own feasibility as well as its compatibility with those already in the 
database. We implement two simple checks on these two properties by testing 
if new constraint a ■ f] < 0 leads to /3 < 0, or some other constraint requiring 
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— a-(3 < 0 already exists in the database. If either of these conditions is true, 
the constraint is removed and accordingly the hypothesized teacher is aban- 
doned and next candidate depth value, which is not so close to the target value 
as this one, is chosen. When new constraints can be checked into the database, 
the intermediate teacher is instantiated. We make the depth assignment at the 
site and continue the learning process as if all the conditions were satisfied. We 
call this process rehearsal because we carry out the relaxation without knowing 
whether there is a feasible set of (3. We summarize the second set of constraints 
m A ■ (3 < Q. 

The system will rehearse and practice, like a baby learning to walk, trying to 
reach the final goal from an initial state, while generating constraints on its gaits 
at each step along the way. Having obtained these two sets of constraints on /?, 
we can proceed to find the set of (3 that satisfy most constraints by optimizing 
the following linear programming problem, 

LP : minimize: ^ 

* j 

subject to: A - (3 — 5 < —1, A ■ (3 — 5 < —1, 5 > 0, <5 > 0, /3 > 1, 



where > 1 is a weighting factor between the two sets of constraints, here 
we simply set it to 1. Since not every constraint can be satisfied, we introduce 
slack variable S and S to turn them into soft constraints. Linear programming 
is used to find the set of f3 that minimizes the total amount of violation of the 
constraints. 

Once LP yields a set of /3, we examine the constraints’ slack variables to see 
which constraint is most severely violated (the largest positive d or 5). We find 
that a bad constraint is typically generated by making a hasty jump before the 
condition is mature, putting an unnecessarily harsh constraint on /3. We go back 
to the constraint database and remove this constraint and choose alternative 
teachers for all the patterns that give rise to this constraint. This prevents that 
constraint to be selected again in subsequent rehearsals. We remove enough bad 
constraints till a feasible (3 is found. We test its validity by relaxation using 
this (3 to see if it can actually drive the system from the initial state to the 
final state for each training example. The learning and checking processes are 
iterated until final configurations for all the training images are correct. The 
learning proceeds from simple to complex images, to gradually build up a set of 
reasonable constraints. Most time when a new image is learned, only a couple 
of iterations is sufficient to obtain a new j3 such that all 5 and 5 = 0. 

3 Results 

Learning on a small set of training images containing two objeets singles out a 
unique value for /?, where f3 = [18,9,97,23.3,3.2,86.7,3.35,16.5,42.5,137,20.8]. 
With this set of parameters, the model produces reasonable results for a set of 
test images that the system has never been exposed to before. 
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Figure 6 shows how the system responds to a test image with five overlap- 
ping rectangles in the scene. The system generalizes very well in its response 
to this new input configuration. A sequence of 8 snap shots are taken at dif- 
ferent time points during the evolution of the system. Snap shot 1 shows the 
system detecting T-junctions and starting propagating its initial result one level 
up the hierarchy. Snap shot 2 shows the information has propagated to the third 
level, and propagation of depth information within surface is now evident at the 
second level. Snap shots 3 and 4 show the information has propagated to the 
fourth and fifth levels respectively. Snap shot 5 shows the information starts to 
propagate down the hierarchy, introducing rapid filling-in of surface depth and 
depth segregation in snap shot 6. Snap shots 7 and 8 show the completion of 
surface/contour depth interpolation and segregation. All these are completed 
very rapidly in two iterations up and down the hierarchy. 

4 Discussion 

In this paper, we present a hierarchical MRF model to perform depth segrega- 
tion of region edge maps. The model is hierarchical rather than simply multiscale 
because its fine-to-coarse transform is topology-dependent. In this work, we pro- 
pose a broader view that clique potentials in MRF can be used to encode any 
local logical decision rules. By introducing a set of rules that asserts continuity 
of depth assignment values along contour and within surfaces, and discontinuity 
of depth assignment values across contours, we demonstrate a system that au- 
tomatically integrates sparse local relative depth cues arisen from T-junctions 
over long distance into a global ordering of relative depths. Interestingly, because 
the rules we set are encoding relative relationships between objects, the system 
trained on scenes containing two objects can actually generalize and perform 
correctly when a scene containing five objects is first encountered. 

We also propose a new method called learning from rehearsals for estimat- 
ing MRF parameters. In this method, we derive a set of constraints based on 
perturbation of target solutions and the rehearsals of relaxation processes, and 
then use linear programming to obtain feasible solutions. Conflicting constraints 
are removed and constraint derivation by rehearsals and parameter solving are 
repeated until there is a set of parameters that work correctly for every test 
image. We do not have a theoretical proof that the learning of this system will 
actually converge. We have restricted our domain of investigation to a world of 
simple shapes so that we can gain a better understanding of the system and 
associate constraints with their origins. 

Another assumption we made is that the input edge maps are closed con- 
tours. There is no technical difficulty here in so far as there exist a number 
of algorithms such as active contours [16] and region competition algorithms 
[37,39] that can produce complete and closed contours. However, depth segrega- 
tion and ordering can potentially help segmentation by feeding back additional 
constraints to organize the contour detection and completion process itself. Ear- 
lier work by Belhumeur [2] and recent work by Yu and Shi [34] are examples 
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7 8 



Fig. 6. Dynamics of the HMRF’s response to a 5-object test image. The parameters 
are learned on a few 2-object images. Shown here are a number of snaps shots taken 
at different time points during the depth segregation computation. The hierarchy is 
traversed twice till its complete convergence to the correct labeling. 



of how depth cues and intensity cues can be integrated simultaneously into the 
segmentation process. These are potential directions for future research. 

We think this HMRF model for depth segregation might provide a plausi- 
ble computational framework for reasoning about and understanding the basic 
computational constraints and neural mechanisms underlying local and global 
integration and figure-ground segregation in the brain. This work provides us 
with several insights to some psychological and neurophysiological phenomena. 

First, brightness has been observed to propagate in from the border in the 
psychophysical experiment by Paradiso and Nakayama [29]. Such phenomenon 
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has been postulated to be mediated by horizontal connections in VI, for exam- 
ple in Grossberg and Mingolla’s model [11]. Here, we show that a hierarchical 
framework can speed up the diffusion of depth assignment process considerably. 
In fact, traversing up and down the hierarchy twice is sufficient to complete the 
computation. This suggests that both the brightness perception and the depth 
segregation could be mediated by the feedback from V2 and V4, which are known 
to have reeeptive fields two and four times larger than those of VI respectively. 

Second, while Paradiso and Nakayama’s experiment suggests diffusion in the 
brightness domain, the similarity in dynamics between brightness diffusion and 
our depth assignment suggests depth segregation and assignment might be the 
underlying process that carries the brightness diffusion along. By the same rea- 
soning, one would expect other surface cues such as color, texture and stereo 
disparity should also be accompanying, if not following, the depth assignment 
process. It will indeed be interesting to examine experimentally whether the 
propagation of surface cues follows the depth assignment process or occurs si- 
multaneously. That Dobbins et al. [7] found a significant number of VI, V2 and 
V4 cells sensitive to distance even in monocular viewing conditions suggests that 
depth assignment might be intertwined with many early visual processes. 

Finally, the hierarchy presented is not simply a multiscale network in that, 
when the information travels up, the topological relationships between different 
objects are taken into consideration in such a way that the same relaxation 
procedure can be applied at each level. For example, edges of overlapping shapes 
are kept (Fig. 4a), whereas the edges of two nearby shapes appearing side by side 
would disappear at a coarser resolution(Fig. 4b). This operation can be achieved 
by taking the sum of depth polarities during the down-sampling process. In order 
to accomplish this in the network, depth polarity of edges needs to be computed 
and represented explicitly. This might provide a computational rationale for the 
existence of the depth-polarity sensitive cells von der Heydt and his colleagues 
found in VI, V2 and V4 [36]. 
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Abstract. Within a human motion analysis system, body parts are 
modeled by simple virtual 3D rigid objects. Its position and orienta- 
tion parameters at frame t -|- 1 are estimated based on the parameters at 
frame t and the image intensity variation from frame t to t -|- 1, under 
kinematic constraints. A genetic algorithm calculates the 3D parameters 
that make a goal function that measures the intensity change minimum. 
The goal function is robust, so that outliers located especially near the 
virtual object projection borders have less effect on the estimation. Since 
the object’s parameters are relative to the reference system, they are the 
same from different cameras, so more cameras are easily added, increas- 
ing the constraints over the same number of variables. Several successful 
experiments are presented for an arm motion and a leg motion from two 
and three cameras. 

Keywords: Human motion, robust estimation, twist. 



1 Introduction 

The literature of human motion analysis (e.g., [WADP97,HHD98,WP00]), has 
reported an important progress. However, the problem of tracking of a human 
person motion is still unsolved. We present in this paper work in progress, instead 
of completed research, on a combination and modification of different approaches 
used before, and some experimental evidence that our more general approach is 
promising. 

Bregler and Malik [BM97] built a system able to track human motion with 
great precision, including very challenging footage like Muybridge’s photograph 
sequences. They defined a 3D virtual model of the subject and a goal func- 
tion over body part position parameters that measures the changes in image 
intensities. Using the twist representation for rigid transformations and the flow 
constraint equation, they managed to make the goal function lineal in the pa- 
rameter variables. They could therefore apply an iteration of linear optimization 
techniques and a warping routine to obtain a very reliable procedure for 3D 
position estimation. Our system is a modification of this one. The differences 
are the following: First, our goal function is robust, so that no EM procedure is 
needed afterwards; instead our optimization directly performs a robust param- 
eter estimation. Second, we do not use the flow constraint equation but direct 
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difference of pixel intensity, thus fewer assumptions, such as small motion and 
constant intensity, are assumed and no warping procedure is needed; due to 
these two previous differences, we cannot apply a linear optimization technique 
as we will explain below. Third, our parameters are reference-based instead of 
camera-based, so additional cameras do not increase the number of variables to 
be estimated. And forth, we handle arbitrary rotations in the articulations, so 
that a fixed axis for each articulation does not have to be defined by the user, 
and arbitrary motion can be modeled. Our system is not a finished product and 
therefore its performance cannot be compared to Bregler and Malik’s, but we 
will try to convey why we think this project is promising. 

The Cardboard People system [JBY96] performs robust estimation of motion 
parameters for 2D regions. Assuming the flow constraint equation and a model 
for the motion of each patch, motion parameters are robustly estimated using 
a non-linear optimization procedure. This system relies on a good estimation of 
the flow constraint equation coefficients. We can say that our system is roughly 
a 3D version of Cardboard People. 

Wachter and Nagel [WN97] describe a system in which motion parameters are 
estimated using a Kalman filter. An optimization procedure finds the parameters 
that make up the difference between the predicted parameters and the ones 
that agree with the edges of the next frame minimum. Their results are very 
impressive. Our system would benefit from a stochastic prediction. 

Hunter, Kelly and Jain [HKJ97] present a system for motion tracking in which 
each step enforces the kinematic constraints by projecting the estimated solution 
based on the observation processes. The system works well using only black 
silhouettes because the modified EM algorithm proposed is a robust formulation 
that makes the recovered motion less sensitive to constraint violations. 

All these systems and ours as well assume that a virtual humanoid that 
matches the real subject in size and initial position can be defined by other 
means in practice, by user interaction. This is a research subject in itself, and 
will not be discussed any further. 

2 Problem Formulation 

Given the film It{x, y), we consider two consecutive frames at times t and t + 1. 

The virtual model of a body part is an ellipsoid of appropriate dimensions. 
Virtual cameras of the real cameras are defined by a camera calibration routine. 
Assume that the 3D pose (position and orientation) of the ellipsoid at time t 
is known. The problem is to find the change in the 3D pose of the ellipsoid 
so that the motion coincides with the real image’motion. Let <f> be the pose 
transformation of the ellipsoid from t to t J- 1 ; (j) \s defined by 6 real parameters 
that will be discussed in detail later. 

Let (x, y) be a pixel, and {ux{x,y,(f>),^iy{x,y,(f))), its displacement vector 
when a 3D point that is projected onto the camera pixel moves according to <p. 
The goal functional E' is the brightness change sum over the point projections 
before and after the pose transformation. 
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Fig. 1. p(x,0.4 X 256) and p(x, 1.0 x 256). 



We consider the functional E': 



E'{phi)= p{It{x,y)-It+i{x + U:,{(j)),y + Uy{(j)))= Y Pi^^) (1) 

{x,y)€TZ (x,y)e'R 



where (f> £ are the 3-D motion parameters, It{x,y) is the image brightness 
(intensity) function for the initial frame, It+i{x,y) is the image brightness func- 
tion for the final frame, Ux{<p), Uy{4>)) are the horizontal and vertical components 
of the flow image at the point (x, y), which is the projection from R? of the mo- 
tion associated with the parameters 0; TZ is the patch to consider, in this case, 
the projection of the virtual ellipsoid that models that tracked part and, finally. 



P{t) = 



0-2 + t2 



is the function that reduces the influence of some outlying measurements of the 
brightness difference and allows an estimation of the dominant parameters; there 
are other p-functions that can be considered as well to obtain a different robust 
estimation. In the rest of this paper, we will use the one above. See Figure 1 for 
a representation. 

The term E'{(f)) measures how well all pixels of frame t match their corre- 
sponding ones in frame t + 1, and its minimum is used to calculated the new pose 
of the ellipsoid. For the analysis from frame t + 1 to < + 2 this new pose is used, 
to calculate the pose at t + 2. If for any reason (partial occlusion, not finding 
the optimum, etc.), the new pose at t + 1 is not totally correct, the error will 
persist in the analysis of the rest of the sequence. In order to allow the system 
to recover from errors during some frames, we add another term that compares 
the intensity average of each new region with the original region in frame 0. 

Assume that Iq is the average at frame 0. We define the functional. 



E"{(j))= Y Pi^o- ^t+i{.x + Ux{(j)),y + Uy{(j))). (2) 

{x,y)e'R 



Notice that the each pixel is not compared with another single pixel, but that 
the average is used. 
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Fig. 2. Displacement vector. 



The objective is to minimize E = E' + E" relative to for which a precise 
definition of (ux,Uy) and its gradient are needed. 



3 Motion Projection 



The object pose relative to the reference frame can be represented as a rigid 
body transformation in using homogeneous coordinates: 



qr 
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ri3 


T21 


T22 


T23 


rsi 


X32 


r33 


\ 0 
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0 




■ go, 



where go = (2:0, 2/0, -Zo, 1 )^ is a point in the object frame and g^ = (xr, Vr, Zr, 1 )^ 
is the corresponding point in the reference frame, qc = {xcVc, Zc,l)^ is the 
corresponding point in the camera frame: g,. = Me ■ qc, where Me is the trans- 
formation matrix associated with the camera frame. See Figure 2 . 

Using orthographic projection with scale s, the point g^ in the reference frame 
gets projected onto the image point {xim, y^m)^ = s ■ {xc, Vc)~^ ■ s is equal to the 
focal distance divided by the distance of the ellipsoid center to the camera, which 
happens to be a good approximation for all the points on the ellipsoid. 
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It can be shown ([MLS94]) that for any arbitrary G e SE{3), there exists a 
vector ^ = {vi,V 2 ,V 3 ,Wx,Wy,Wz)~^ , called the twist representation, with associ- 
ated matrix 
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such that G = = Id + 

We define the pose of an object as ^ = {vi,V 2 ,vs, Wx, Wy 
the object frame is projected onto the image location {xim 



Wz)~^ . A point qo in 
Vim) with: 
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The image motion of point from time t to time t + 1 is: 



f Ux \ f ^im{i 3- Xizn{t)\ 

\ '^y J \ T 1) yim (0 / 

By using (3) we can write the previous expression as: 



( 3 ) 



(^) = (i ? I o)-ivG((i+d d- -Id). 

with = (u^, ^ 2 , ^^ 3 , Wy, w'^)^ such that and s' = — 1- 

We assume that the scale change due to the motion is negligible from frame 
to frame, since the objects are far from the camera. Therefore s' = 0. 

Assuming that the motion is small, i.e., the ellipsoid center moves a few 
centimeters and the axis orientation changes a few degrees, we have ||^'|| ^ 1. 
We approximate the matrix by Id + Experimentally, we confirm that the 
approximation is very good. 

We rewrite the previous expression as: 



(^) = (i ? 0 ( 4 ) 

The value of </> is {v'i,v' 2 ,v'^,w'x,w'y,w'z)~^ , which are the optimization variables 
of the problem. 

Based on (4), for a pixel (x, y), we need only calculate Qc to describe the image 
motion in terms of the motion parameters (j). The 3D point qc is calculated by 
intersecting with the ellipsoid a ray orthogonal to the camera image pixel. To do 
so, we associate with each pixel at <o the corresponding Zc of the closest point 
on the ellipsoid surface that is projected onto that pixel. 

The parameters <f) are independent of the camera. Hence, when we add more 
cameras, no new variables are needed but more constraints are added. 
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4 Kinematic Chain 

The parameterization of a single body part Pi has been discussed in the previous 
section. Assume that a second body part P 2 is attached to the first one in a point 
and that Ep^ (</>i) + Ep^ {(j) 2 ) is the new functionals to be minimized that includes 
both parts. The optimization has to take into account the fact that the parts 
share a point when they move, by means of a kinematic constraint. 

Let Pi and p 2 be the coordinates of the shared point in the two object frames. 
Let ^ 1,^1 ,^2 and ^2 be the twist and its change for each part as in the previous 
section. In order to keep the parts attached, the following equality must be true: 

gG g5l . — g?2 gf2 . 

If the point was shared at frame t, it is true that 

■ Pi = ■ p2 ~ Pr 

where Pr are the coordinates of the joint in the reference system. Hence, the 
constraint simplifies to 

■ Pr = • Pr 

and using again the first order approximation of the exponential function, it 
leads to ^ ^ 

C{(j)lA2) = £,1 ■Pr-£2 -Pr=0. 

The linear equations C = 0 above can be used to express three variables in 
terms of the others, and reduce the number of variables in half. This approach 
may have the problem that constraints the relative motion to be perfect rotations 
which, in practice, are not. When n parts are used, the number of variables is 
3n + 3; the last 3 comes from the fact that there is one part (the trunk) that 
does not depend on any other part, and uses 6 degrees of freedom. This is the 
approach taken, because the gains of reducing the number of variables are more 
important that its drawbacks. 

5 Multiple Cameras 

The use of several cameras reduces the motion ambiguity that comes from hav- 
ing only one camera. The motion parameter 0 for each part does not depend on 
a camera, so there is no need to introduce new variables. For each part P and 
camera c, a term Ep,c is added to the functional to be minimized. Therefore, 
cameras can be easily added or dropped without changing the number of vari- 
ables or the optimization procedure. For each camera, its associated matrix M 
must be supplied from calibration. 

In fact, at least two cameras are needed if the virtual ellipsoids in the original 
frame have to be easily positioned by the user without taking real measurements 
in the recording setting. The user could point to the extreme points of the 
ellipsoids in two views and the system can easily get their 3D position. 
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6 The Use of Color 

Color also reduces significantly the ambiguity during tracking specially in the 
absence of textured surfaces. 

In the HSL representation, the Hue represents an angle and, therefore, it has 
a discontinuity between 2tt and 0. Following [ROKOO], we change coordinates 
and define: 

h = S cos H, s = SsinH, I = L. 

The intensity value I used in the previous sections can be one of the three color 
channels, that we represent as U, Ig and R. It is well known [PK94] that for 
tracking purposes the L component is less stable than the others on the same 
object due to illumination changes; therefore, we give double weight to the other 
components. In Equation 1, p[AI) is replaced for 

0.4p(Z\U) + 0.4p(Z\/,) + 0.2p(Z\R). 

A similar replacement is done in Equation 2. 

When the input images are black and white, H=S=0 and the only channel 
useful is L. Thus our system handless both color and black and white cameras, 
and even a mixture of them. 

7 Multi-scale Analysis 

A multi-scale analysis, as in [WHA93], was implemented. Frame images, and 
virtual projections are zoomed out with a Gaussian filter. Motion estimation at a 
rough scale is used as an initial solution(seed) of the optimization procedure at a 
finer scale. In our approach, the use of a rougher scale has an efficiency advantage: 
Since images are smaller more iterations and more possibilities (offsprings) can be 
considered in the search procedure; in a finer scale, the solution should improve 
according to the improved image precision. 

8 Optimization 

To minimize the functional E, that includes a term for each part, camera and 
color channel, we use a genetic algorithm described in [Mic96] that works well 
on non-linear optimization problems. The representation used for the possible 
solutions is simply a vector of dimension equal to the number of variables (3 for 
each part). The fitting function is E. 

The system starts with a number of initial values (seeds) for <f), for each part, 
that include some values randomly calculated, and an initial value that is the 
solution of the previous frame (0 for the first frame) for the rougher scale, and 
the solution of the scale above for finer scales. 

The population of values is replaced by 25%, and there are 10 offsprings 
for each of the above. The operators used are whole and simple arithmetical 
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crossover, uniform mutation, boundary mutation, non-uniform mutation, whole 
non-uniform mutation, heuristic crossover, Gaussian mutation and pool recom- 
bination. The experiments use between 500 and 1000 iterations. There is, of 
course, no guaranty that a global optimum is found. 

9 Implementation 

Using the camera calibration, the user sets the ellipsoid poses at time 0 so that its 
projections coincide with the three real body parts to be tracked in each view. 
This is done by calculating the best 3D coordinates for the image positions 
such as wrist, elbow, knee, etc. The shape parameters of each part are also set 
manually for the whole film. 

The system then tracks each virtual part in 3D: First, the virtual projections 
are calculated at frame t; for each pixel in the image range, it calculates the 3D 
position of a point on the ellipsoid that is projected onto the pixel. Frame t + 1 
is then loaded. An iterative procedure begins that initializes = 0. At iteration 
n, the function is evaluated for certain value of cf). The {ux,Uy) displacement is 
estimated using the pose change (J) for each pixel (x, y). The difference of image 
intensities between the pixel {x + Ux, y + Uy) at frame t + 1 and the pixel (x, y) 
at frame t is accumulated for all pixels, and also the variation of the same pixel 
in t + 1 with respect to the average in t = 0 for the corresponding part. For 
the experiments, The value of (p that minimizes the functional within all the 
iterations is used to calculate the t+1 pose. The procedure starts again for the 
next frame. 

10 Experiments 

Experiments were designed to test the system performance. There are several 
issues to chock. First of all, the system should work on different films, with 
different body parts. Second, we would like to know the importance of color 
tracking. And third, some comparison is needed with respect to systems that do 
not use a robust function. 

A Pentium II processor at 450 MHz. is used. The frames are 640 x 480 pixels, 
and the cameras are situated between three and five meters from the subject. 
The parameter a for the robust function is 0.5 x MaxQ, where MaxQ is the 
number of quantization levels for a color channel, in our case 2 . 

10.1 A Black and White Film 

A subject is filmed by three Black and White cameras. Three ellipsoids are used 
to track the three parts of an arm. See Figure 3. Several levels of resolution were 
tried. The system works well using using the finest, first, level or the second. 
Since, the results on the second level are good enough and we can get them 
faster, we report only on them. Also, if both levels are used, the results are not 
significantly better than using only the second level. 
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Fig. 3. Arm tracking in Black and White 
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The kinematic constraints used keep the articulations connected, and also 
the shoulder is fixed in space in the virtual model. 

50 frames (the whole sequence) can be tracked correctly using the second 
level. If only the third level is used, only 10 frames are tracked before the ellip- 
soids loose track of their corresponding body parts. At this level, the motion is 
not very clear in all the views because of the low resolution. Using two levels the 
system processes approximately 0.8 frame/min. Using only the second level, the 
speed is 2 frames/min. 

In some frames the virtual elbow does not coincide with the real elbow; 
it seems that the original position given by the user for the shoulder was not 
correct, and the upper-arm length does not lit in the corresponding region. Still, 
the system is able to track the arm. 



10.2 Arm Tracking within a Color Film 

Another sequence was captured using two color cameras. The arm was selected 
again because of its visibility and mobility. The results are in Figures 4 and 5. 
The shoulder is also assumed fixed in space. 

Using only the second level, the system starts in frame 0 and tracks 270 
frames, which is the length of the whole sequence. After frame 120 (see the 
previous to the last frame shown) the subject moves the shoulder, so the virtual 
arm cannot cover the hand. However, 30 frames later the system has recovered 
the hand because it tries to keep the original colors of the parts. The throughput 
is 2.8 frames/min. 

If the first resolution level is used, no better results are found, since the results 
at the second level are already very good. It just makes the system slower. If only 
the third level is used, the system cannot track more than 20 frames, although 
the arm motion is still very clear at that level; maybe the optimization algorithm 
is not good enough, but it is still unclear why the system does not work at the 
lowest level of resolution tried. 

If only the black and white information is used (the L component) the ellip- 
soids loose track of their parts when the subject moves the shoulder and does not 
recover. This means that the color information makes the system more reliable. 



10.3 Leg Tracking 

A sequence was filmed with two color cameras and a leg was tracked with three 
ellipsoids. The tip of the foot is assumed fixed in space. 

Figure 6 shows the results. The subject wears very dark clothes so the color 
information is not very useful. Also, since there is no change of texture or color 
between the parts, the sequence is really challenging. The system tracks correctly 
40 frames. It is clear that if the virtual model includes a part for the trunk the 
setting would have less ambiguity, i.e., virtual legs could not move freely in the 
trunk regions. 
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Fig. 4. Arm tracking. 



10.4 Robustness 

We would like to compare the system performance with respect to systems that 
are not robust. To do so, we changed the value of a in the previous experiments 
and compare the results until a fixed frame. If cr is 1.0 x MaxQ, p is almost 
lineal in the range of intensity variation and, therefore, a large variation counts 
more than a small one. If a is 0.5 x MaxQ, p does not count large variations 
necessarily more than small ones, and outliers have less importance. 




Fig. 5. Arm tracking (cont.). 



Figure 7 shows different results for different values of a: 0.2, 0.4, 0.6, 0.8 
and 1.0 of MaxQ. We can see that values under 1.0 are better than the others. 
We also had similar results for the two arm sequences, but are not shown for 
paper limitations. In the color tracking of the arm, there is not a significant 
performance difference at that frame (80) when the value of a is changed. 

As we said before, we choose a value in the middle of the above, a = 0.5 x 
MaxQ, for all the previous experiments. 
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Fig. 6. Leg tracking. 



On another experiment with the color tracking of an arm, the function p was 
replaced by a simple absolute value. In this case, the tracking was poorer: Arm 
and hand were farther from their true directions during most of the sequence, 
covering background regions. In a sense, the robust function adds tracking sta- 
bility. 
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Fig. 7. Robustness comparison of “leg” sequence. 



11 Conclusions and Future Work 

We have presented a theoretical framework for robust tracking of human parts 
based on a 3D model. Our experiments show that the robust estimation of mo- 
tion parameters works better than non-robust estimations, but more testing is 
needed. Also, the use of color makes the system able to recover easily from 
tracking errors. 
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A more reliable global optimization technique may be needed. However, the 
use of more information from all the body parts may aecelerate the convergence 
in a space of around 45 dimensions. 

It is also expected that if more realistie parts are used for the virtual model, 
less motion ambiguity is allowed and we would get better results. Also better 
rendering techniques (perspective projection) that usually come with a virtual 
humanoid may also help. But, if the virtual parts fit better the real parts, less 
motion flexibility is allowed to eompensate for not perfect rotations. It is possible 
that the robustness be even more important. 

Better color representations (e.g. LUV or Lab) can also be used, and different 
robust parameters for each term of the functional can be tried (difference with 
frame 0, with previous frame, for eaeh part, and color channel of each camera). 

Although the multi-scale analysis was not useful in our experiments, except 
for finding a resolution level of fast and correct tracking, it may prove to be 
essential when the full body be tracked. 

We consider that our results are promising and that it is worth pursuing 
further this research. 
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Abstract. When we seek to directly learn basis functions from natural 
scenes, we are confronted with the problem of simultaneous estimation of 
these basis functions and the coefficients of each image (when projected 
onto that basis). In this work, we are mainly interested in learning ma- 
trix space basis functions and the projection coefhcients from a set of 
natural images. We cast this problem in a joint optimization framework. 
The Probenius norm is used to express the distance between a natural 
image and its matrix space reconstruction. An alternating algorithm is 
derived to simultaneously solve for the basis vectors and the projection 
coefficients. Since our fundamental goal is classification and indexing, we 
develop a matrix space distance measure between images in the training 
set. Results are shown on face images and natural scenes. 



1 Introduction 

In recent years, there has been a lot of progress in the mathematical representa- 
tion of natural scenes [7]. Once it became clear that coding methods using the 
discrete cosine transform (DOT) [9], Gabor expansions [4], wavelets [3], etc. were 
quite successful in generating compact codes for images, there was an increased 
interest in directly learning such bases from natural images. Directly learning 
the basis vectors presents a more challenging computational problem than the 
usual coding case, because in the former, both the bases and the coefficients have 
to be estimated from the natural images. 

The most comprehensive work in this general area of simultaneous estimation 
of both basis vectors and coefficients is the work of Olshausen and his collab- 
orators [7,6,8]. In this body or work, the main interest is in learning compact 
and (usually) overcomplete representations of natural scenes. In this approach, a 
joint optimization problem is typically constructed on both the basis vectors and 
the coefficients. A regularization prior is always added to enforce prior knowledge 
of sparsity of the coefficients. Recently, this work has evolved toward adding a 
mixture of Gaussians prior [8] in order to enforce the constraints of sparsity and 
distributivity of the coefficients. 

In this paper, we are mainly interested in recasting this problem of simultane- 
ous estimation of bases and coefficients in a matrix space. We wish to point out 
that matrix space representations of natural images has been mostly ignored 
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despite the fact that it is mostly common knowledge that the singular value 
decomposition (SVD) [5] is sometimes a good choice for image representation 
and compression [10]. However, all the previous work (that we are aware of) on 
SVD-based image representation is focused on single images. There is no previ- 
ous work on estimating common matrix space bases from a set of natural images. 
To state the goal of this paper, we wish to simultaneously learn a matrix basis 
set and the projection coefficients of each image. Given a set of 2D grayscale 
natural images, we represent each of them as matrics and then formulate the 
learning problem as a joint optimization of matrix basis vectors and projection 
coefficients. 

In contrast to earlier work by Olshausen and his collaborators [7] , our primary 
motivation is not to develop a compact code based on the statistics of natural 
scenes. Instead, we are more interested in common matrix space representations 
for the purposes of recognition, classification and indexing. That is, we believe 
that matrix space representations can be used to develop new matrix space 
distance measures for the purposes of classification and indexing. Some of the 
experimental results presented in the paper aim toward this goal. 

2 Grayscale Intensity Images as Matrices 

As mentioned in the Introduction, we are mainly motivated by the overall success 
of image coding. Our principal aim in this paper is to learn a common matrix 
space representation from a set of images without any a priori knowledge of the 
basis set or of the projection coefficients. The fundamental idea is to express each 
grayscale intensity image as a matrix and then investigate the extent to which 
compact coding allows us to extract a common matrix basis from the set of 
pregiven images. Since coding works and works well, we expect that each image 
is expressible by a compact set of basis vectors and coefficients. For example, it 
is well known though not so widely exploited that the SVD representation of a 
2D grayscale intensity image has a spectrum that rapidly falls to zero [10]. 

Denote by X, the matrix corresponding to a grayscale intensity image. The 
SVD of the image is given by 

X = USV"^ (1) 

where U = V = I. Since the image matrix need not be square, note that 
the dimensions of the U and V matrices can be different. Reconstructions of the 
original image using a set of D components can be written as 

D 

X = Y^^s{i,i)u,vf (2) 

where Ui and Vi are the tth eigenvectors. 

In Figures 1 and 2, we show the spectrum of the well known Lenna image and 
the SVD-based reconstructions respectively. The first 5 SVD component images 
(computed as Uivf) are also shown in Figure 2. It is evident from both the 
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Fig. 1. Spectrum of the Lenna image (256x256). Left: Log-log plot of the spectrum. 
Right: Spectrum plot. Note the almost linear nature of the log-log spectrum for the 
dominant singular values. 








Fig. 2. Lenna reconstructions and components: Top row: Reconstructions us- 
ing 1, 5, 10, 15 and 20 components. Bottom row: The SVD component images Uivf 
corresponding to the first five components. 



spectrum plots and the component images that the spatial frequencies tend to 
vary in inverse proportion to the SVD coefficients; higher the spatial frequencies 
in a component image, the smaller the SVD coefficient and vice versa. The 
rapidity with which the SVD coefficients fall off is evident from Figure 2. 

3 Learning a Common Reference Frame 

We now embark upon the formulation of the problem. Since we wish to represent 
2D grayscale intensity images as matrices, the technical challenge is one of finding 
a common eigen reference basis set of vectors to represent all of the images in 
the chosen set. Since we have decided to use a matrix space representation for 
the non-square images, the common eigen reference frame consists of a set of row 
eigen vectors, each of which has a corresponding counterpart in the set of column 
eigen vectors. Insofar as only the eigen reference frame is deemed common, the 
projections of each image in the chosen set onto the basis will be different and 
unknown a priori. Consequently, not only do we have to learn the common eigen 
reference frame, in addition, we have to compute the projections of each image 
onto the reference in order to determine the representation error. 
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Let the set {X^, k e {1, . . . ,K}} denote the collection of images. The com- 
mon eigen reference frame is denoted by the pair {U,V) which is intended to 
invoke the SVD association. The projection of image onto the basis set is 
denoted by with the set {Ak, k e {1, . . . , K}} denoting the set of projec- 
tions. Each image Xk is of size M x N and the sizes of the basis sets U and V 
are M x D and N x D respectively. The size of each A^ is D x D. Note that 
each Tfc, /c e {1, . . . , AT} is diagonal since it is a projection onto the basis set. 
Inequalities, such as D < min(M, N) hold as a consequence of orthogonality 
conditions on U and V . 

The central technical contribution of this paper is a mathematical unpacking 
of the following intuition: A reasonable criterion for learning the basis set {U,V) 
is to minimize the representation error of each image X^ in the set of K images 
when projected onto the space of basis sets {U,V). This criterion is somewhat 
complicated by the fact that the projections of the images onto the basis set are 
themselves computed after the basis set is fixed. 

We use the Frobenius norm ||A|||. = |ayp to mathematically charac- 
terize the representation error [5]. This norm is chosen because of its close re- 
lationship to the matrix spectra. The central cost function used in this paper 
is 

K 

{U, E) = min E„,atrixbasis(C/, V) = min V \\Xk - UAkV^\\j, (3) 
iu,v) 

with 

Ak = diag {U^XkV), ke K} (4) 

and the constraints 

U'^U = Id andV^V = Id. (5) 

Equation (4) relates the projection Ah with the basis set {U,V). The diag oper- 
ator emphasizes the fact that each Ah is diagonal. In this paper, we have elected 
not to treat the projections as quasi-independent variables. Following [8], it is 
certainly possible to treat Ah as a separate variable and associate a Bayesian 
regularization prior with it. We have not done so for reasons of simplicity at 
this preliminary stage. The constraints in (5) express the fact that U and V 
are orthonormal which in sync with their treatment as SVD-like bases. Note 
that we have also not chosen to enforce the constraint that each diagonal entry 
in Ah is strictly positive. For U AhV^ to be linked to an SVD representation, 
this constraint has to be active. However, we merely treat the reference basis 
as “SVD-like” in this paper. This constraint can also be enforced as a Bayesian 
prior if needed. 

We now derive an alternating algorithm to minimize the energy function 
in (3). First, we enforce the orthogonality constraints in (5) using Lagrange 
parameters [1]. The Frobenius norm is also rewritten using matrix operators. 

K 

L^matrixbasis(t/, V,p,;z) = ^ toce [(Xfc - UAhV^f{Xh - UAhV^)] 

k=l 

+trace — Id)] + trace [^{V^V — Id)] (6) 




Learning Matrix Space Image Representations 157 



with the understanding that = diag (U'^XkV). The Lagrange parameter 
matrices /i and are both symmetric [\] D x D matrices. This energy function 
can be further transformed by dropping terms not involving the basis set ( U, V) 
and by enforcing the orthogonality constraints in the matrix representation error 
norm. 

K 

Ematiiychasis{U,V, IJ, I') = -2^ trace [Xl'UAkV'^")] 

k=l 

Ttrace — Id)] + trace [i'{V'^V — Id)] (7) 

To derive the alternating algorithm, we first hold V fixed and solve for U in (7). 
Then we alternate by solving for V given U. Once U and V are updated, A^ is 
updated Vfc e {1, . . . , iL} in order to be in lockstep with the basis set [U,V). 
Differentiating (7) w.r.t. U and setting the result to zero, we get 



K 



K 



^ XkVAk = U^i^U = J2 XkVAk^i-\ 



fc=i 



fc=i 



Enforcing the orthogonality constraint for U, (8) is transformed into 



( 8 ) 



U^U = /x-i XkVAk j XkVAk j = Id 

T 






= [f2^kVAk\ (f^^kVAk 

/ \k=l / 



,fc=l 
/ K 



K 



Y.^kVAk XkVAk 



Kk = l 



\k=l 



K 



U = Y,XkVAk 






K 



K 



Y^^^vAk E XkVAk 



\k=l 



Vfe=l 



(9) 



The above solution for U is obtained relative to a fixed V. Completely analogous 
to (8) and (9), we can solve for V relative to a fixed U. 



K 






Y^kUAk 






K 



K 



Y^lUAk [Y^XlUAk 



\k=l 



\k=l 



(10) 



Once we’ve obtained a candidate basis set (U,V), we keep each Ak in lockstep 
with the basis by setting 



Ak = diag (u'^XkV^ ■ 



( 11 ) 



The overall algorithm is summarized below. 
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Fig. 3. Original images. Left: Face 1. Right: Face 2. 



Learning Matrix Representations 



Initialize U, V to random M x D and N x D matrices respectively. 
Begin A: Do A until AE < AEthr- 



U' 



old 



U, = V, = Afe, VA: e {1, . . . , K}. 



Ak = diag (U'^XkV) , Vfc G {1, . . . , K}. 



U = ELi ^kVAk 



ELi^kVA^Y (Ek=i^kVAk 



Afc = diag {U^XkV), Vfce{l,...,A}. 



V = Ek=iXj;UA, 



E^xluA^Y [EtiXlUA,) 



= ELi (ll^fc - -\\Xk- UAkV 

End A 



T|||) 



A theoretical proof showing that the energy in (6) decreases in each step 
is not easily forthcoming. This is due to the obvious heuristic device we have 
employed for updating each A^. In contrast, showing that the energy decreases 
due to the U and V updates is relatively straightforward due to the updates 
being constrained least-squares solutions. In all experiments, we have merely 
executed the above algorithm until an iteration cap is exceeded. We have not 
encountered any stability problems when executing the above algorithm. This is 
quite surprising as one would have expected numerical errors to create problems 
when taking the relevant matrix inverses and square roots. As mentioned in the 

next section, no effort was made to implement the matrix inverses and square 

'TIVT 

roots in a computationally efficient manner. The matrix routines in Matlab 
5.3 were used in all computations. 



4 Results 

Two face images are shown in Figure 3. The images have been registered using 
a non-rigid registration method [2]. Both images are originally 330 x 228 and 
have been down sampled to 150 x 114. 
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Fig. 4. Reconstructed faces using 2 components Left: Face 1. Middle: Face 2. 
Right: Projections. 




Fig. 5. Reconstructed faces using 12 components Left: Face 1. Middle: Face 2. 
Right: Projections. 



In Figures 4 through 8, we present the results of learning a common matrix 
space reference frame beginning with two components and moving up to 100 
components. In each figure, the reconstructed faces are presented along with 
the projection coefficients XkV)]. The first reconstruction with two 

components turned out to be identical for both faces, i.e. 

UArV^' = UA2V^'. 

In the reconstructions (which are identical), the hair texture evident in faee 1 
is clearly discernible. In addition, the reconstruction has the appearance of a 
fuzzy face. As the number of basis components are increased, the quality of 
the reconstruction improves as can be seen in the progression from Figure 4 to 
Figure 8. 

Examine the projection bar charts in Figure 4, Figure 5 and Figure 6. The 
bar charts are a visualization of the projection coefficients. Note the extensive 
similarities in the projection coefficients. They start out identical in Figure 4 
and continue to be strongly related as can be seen in Figure 6. We return to this 
issue later. 
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Fig. 6. Reconstructed faces using 25 components Left: Face 1. Middle: Face 2. 
Right: Projections. 




Fig. 7. Reconstructed faces using 50 components Left: Face 1. Middle: Face 2. 
Right: Projections. 




Fig. 8. Reconstructed faces using 100 components Left: Face 1. Middle: Face 2. 
Right: Projections. 



After reconstructing faces 1 and 2 using 12 basis components, in Figure 9 
we visualize six leading common components. This was done by only visualizing 
those components for which the projections in both face 1 and face 2 were large. 
Each image shown in Figure 9 corresponds to a component where k is 
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Fig. 9. Reconstruction using 12 components: First six components of the 
shared representation. Top: Components one through three. Bottom: Components four 
through six. 



the component index. The first component image is not particularly informative 
but it does correspond to the largest projection value in both face images. The 
second component image is a different story. It bears a striking resemblence to 
the reconstruction using two components as shown in Figure 4. The remaining 
component images show increased spatial frequencies but do not have easily 
discernible patterns. 

Next, we learn a common basis set for three very different images; a baboon, 
an outdoor scene with a boat and a girl. (Each image is 256 x 256). Prior to 
reconstruction, we normalized the spectra of the three images in the following 
manner. First, we evaluated the number of SVD components necessary to recon- 
struct each image with minimal loss. Then, we normalized the spectra of each 
image relative to the mean spectrum. The images before and after normalization 
are shown below in Figure 10. There are almost no visible perceptible differences. 

After normalizing the spectra, we reconstructed the images using 12, 25, 50, 
100 and 200 components. The results are shown in Figure 11. Somewhat to our 
surprise, the original images are discernible in the reconstructions using only 25 
components. After about 50 components, there is no clear visual improvement 
which is also surprising. To take stock of what has been achieved, please note 
that the 25 component reconstructions use a common matrix space basis for all 
three images. Since the original images are quite different, it is not immediately 
obvious why the reconstructions should so closely resemble the originals. 

For a more quantitative understanding of the above reconstructions, we turn 
to Table 1. From the table, we see that the reconstruction error 1 1 — 
for the girl image is worse than that of the baboon and the boat images. (Recall 
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Fig. 10. Top row: Original baboon, boat and girl images. Bottom row: The same images 
after spectrum normalization. 















Fig. 11. Reconstructed images using 12, 25, 50, and 100 components 



that the images {X^} have normalized spectra with the largest singular value 
set to one.) In addition to the reeonstruetion error, in Table 2, we compute the 
matrix spaee distanees between the three images. The matrix space distance 
between image k and image I is defined as 

D 

i?matrixspace(Xfc,X() \\U - U AlV^ \\% = \\Ak ~ Al\\l = ^(Afe, - A,,)' 
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Table 1. Reconstruction errors for the baboon, boat and girl images using 50 compo- 
nents. The leading singular value for all three images is unity. 




Table 2. Matrix space distances between the three images using 50 component recon- 
structions. 
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Table 3. Reconstruction errors using 100 components. 




0.058 



0.153 



0.217 



0.078 



0.079 



0.048 



where X^i is the zth projection coefficient of the fcth reconstruction. Since the 
reconstructions are in a common matrix space, the image distances are natu- 
rally reduced to the distances between the projection coefficients. As before, the 
integer D denotes the number of components used for the reconstruction. 

The next experiment further explores image distances when collapsed onto 
a single matrix space. We took six images — four faces, an outdoor scene and 
a fractal image — and reconstructed them using 100 components. Each image is 
150x114. We first preprocess the images such that each image has its dominant 
singular value set to unity and with the rest of spectra normalized as previously 
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Table 4. Matrix space distances between the six images using 100 component recon- 
structions. 





0 


2 


n 






li 




n 


1 


0 


0.0237 


0.0430 


0.0293 


0.0619 


0.0853 


o 


0.0237 


0 


0.035 


0.0212 


0.0676 


0.0832 








0.043 


0.035 


0 


0.053 


0.1146 


0.1220 








0.0293 


0.0212 


0.053 


0 


0.0667 


0.0794 






0.0619 


0.0676 


0.1146 


0.0667 


0 


0.0817 








0.0853 


0.0832 


0.122 


0.0794 


0.0817 


0 



explained. The reconstruction errors and the matrix space distances are shown in 
Tables 3 and 4 respectively. From the reconstruction error table, it is clear that 
face images 2 and 3 have the worst reconstruction errors. Given this empirical 
fact, we turn to the matrix space distance table in Table 4. The distances from 
all the face images to the outdoor scene and the Julia set are by far the largest. 
There is not a single face-face distance which is greater than a face-non face 
distance. Consequently, the reconstruction error may not be a suitable measure 
by which to gauge the degree of membership of an image to the estimated matrix 
space. 

Finally, we describe initial experiments in automated filter design using the 
matrix space representation. The basic idea closely follows the model used in 
PGA filter design with one crucial difference. From the chosen image, we ran- 
domly choose N K X K blocks where K is the order of the filter mask. Given 
the N K X K “images”, we learn a common matrix basis as before. Once U 
and V have been learned, we display Uiv] as a, K x K filter mask. There are 
a total of K different such filters to choose from for which we implemented the 
following rank ordering scheme. For each filter, we evaluated the total response 
over the entire image. The filters were rank ordered using the total response as 
the metric. The learned filters and the corresponding filtered images are shown 
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Fig. 13. Top: Learned 7x7 filters. Bottom: Corresponding filtered images. From left 
to right: Top 4 filters ranked according to strength of response. 



for 3 X 3, 7 X 7 and 15 x 15 masks in Figures 12, 13 and 14 respectively. We 
noticed that the first two filters always corresponded to an intensity blur and 
a first derivative operator respectively. Visual inspection reveals that the filters 
appear to be ordered according to increasing spatial frequency. 



5 Discussion, Extensions and Conclusion 



5.1 Isn’t This Just Principal Components (PCA)? 

Not really. In most versions of PCA, the images are first converted into a vector 
followed by covariance matrix construction from the pattern vectors. In our 
approach, there is no covariance matrix formed. Instead, a common matrix space 
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Fig. 14. Top Row: First four learned 15 x 15 filters. Second Row: Corresponding filtered 
images. Third Row: Filters 5,7,9 and 10 (15 x 15). Fourth Row: Corresponding filtered 
images. Fifth Row: Filters 12,13,14 and 15 (15 x 15). Sixth Row: Corresponding filtered 
images. 



is estimated from the image intensity matrices. There is no denying the fact that 
matrix representations are at the heart of our approach as opposed to vector 
representations in PCA. Also, there is no statistical interpretation of matrix 
representations in terms of covariance matrices and dimensionality reduction as 
in PCA. 
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5.2 Matrix Space Distances 

We view our initial experiments with matrix space distances as quite promis- 
ing. Out of the six images in Table 4, two were non face images. The matrix 
space distances clearly and unambiguously group the face images together. The 
distances from each face image to the non-face images are always greater than 
the face-face image distances. If matrix space distance turns out to be a truly 
robust distance measure, it could have an impact in pattern clustering, object 
recognition and classification etc. 

There are many ways in which the work presented here can be extended. We 
stress that the current algorithm is quite preliminary and one of our first goals is 
to add a regularization prior to better stabilize A^- Once such priors are added, 
a more proper Bayesian justification (in terms of likelihood and prior) can be 
given. This would place this work in the context of earlier work by [8]. And there 
does not seem to be any obvious reason why mixtures of matrix bases cannot 
be considered in order to provide compact but overcomplete representations. 
Another promising direction of extension is the matrix space distance measure. 
We have only shown preliminary results of applying the distance measure on the 
training set. It is certainly possible to estimate basis vectors from different sets 
of training images and apply the matrix space distance measures to test sets as 
well. 

In sum, we have derived a new learning algorithm for estimating matrix space 
representations of natural 2D grayscale images. Since the images use a common 
matrix basis, this allowed us to construct matrix space distances between them. 
Preliminary results indicate that matrix space image distances possess novel 
classification properties. While these initial results are promising, it remains to 
be seen if matrix space representations and distance measures are truly effective 
in image classification and recognition. 
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Abstract. Supervised segmentation of piecewise-homogeneous image 
textures using a modified conditional Gibbs model with multiple pair- 
wise pixel interactions is considered. The modification takes into account 
that inter-region interactions are usually different for the training sample 
and test images. Parameters of the model learned from a given training 
sample include a characteristic pixel neighbourhood specifying the in- 
teraction structure and Gibbs potentials giving quantitative strengths of 
the pixelwise and pairwise interactions. The segmentation is performed 
by approaching the maximum conditional likelihood of the desired region 
map provided that the training and test textures have similar conditional 
signal statistics for the chosen pixel neighbourhood. Experiments show 
that such approach is more efficient for regular textures described by dif- 
ferent characteristic long-range interactions than for stochastic textures 
with overlapping close-range neighbourhoods. 



1 Introduction 

Supervised segmentation is intended to partition a spatially inhomogeneous tex- 
ture into regions of homogeneous textures after learning descriptions of these 
latter. Generally, neither textures nor homogeneity have universal formal def- 
initions, so that each statement of the segmentation problem introduces some 
particular texture descriptions and criteria of homogeneity. 

For last two decades, one of most popular approaches is to model textures 
as samples of a discrete Markov random field on an arithmetic lattice with a 
joint Gibbs probability distribution of signals (grey levels, or colours) in the 
pixels [4,5,6,10,11]. The Gibbs model relates the joint distribution that globally 
describes the images to a geometric structure and quantitative strengths of local 
pixel interactions. Typically only pixelwise and pairwise pixel interactions are 
taken into account, and the interaction structure is specified by a spatially in- 
variant subset of pixels interacting with each individual pixel. The subset forms 
the pixel neighbourhood. The interaction strengths are specified by Gibbs po- 
tential functions that depend on signals in the pixel or in the interacting pixel 
pair. 
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In this case the texture homogeneity is defined in terms of spatial invariance 
of certain conditional probability distributions, and a homogeneous texture has 
its specific spatially invariant interaction structure and potentials. The super- 
vised segmentation has to estimate these model parameters from a given training 
sample, that is, from a piecewise-homogeneous image with the known map of ho- 
mogeneous regions or from a set of separate single-region homogeneous textures. 
Then the parameters are used for taking the optimal statistical decision about 
the region map of a test image that combines the same homogeneous textures. 
This approach assumes that the parameters estimated from the training sample 
are typical for all the images to be segmented. 

This paper considers the supervised segmentation using the conditional Gibbs 
model of piecewise-homogeneous textures proposed in [7,8]. Here, the model 
is modified more fully reflect the inter-region relations that usually are quite 
different in the training and test cases. Previously, the two-pass (initial and final) 
segmentation had been introduced to implicitly take account of the differences 
between the training and test inter-region signal statistics. The modified model 
involves more natural inter-region interactions so that the segmentation can now 
be achieved in a single pass. 

The segmentation process approximates the maximum conditional likelihood 
of a desired region map providing that the image to be segmented and the train- 
ing sample have the same or closely similar pixelwise and characteristic pairwise 
interactions in terms of particular conditional signal statistics. We investigate 
how precise such a segmentation is for different texture types, in particular, 
for stochastic and regular textures efficiently described by the Gibbs models of 
homogeneous textures [8,9]. 

The paper is organised as follows. Section 2 describes in brief the modi- 
fied conditional Gibbs model of piecewise-homogeneous textures and shows how 
the Gontrollable Simulated Annealing (CSA) introduced in [7,8] can be used 
for maximising the conditional likelihood of the desired region map. Section 3 
presents and discusses results of the supervised segmentation of typical piecewise- 
homogeneous stochastic and regular textures. The concluding remarks are given 
in Section 4. 

2 Gibbs Model of Piecewise-Homogeneous Textures 

2.1 Basic Notation 

Let R = [(m, n) : m = 1, . . . , M; n = 1, . . . , N] be a finite aritmetic lattice 
with M ■ N pixels. Let Q = {0, . . . , Q — 1} be a finite set of grey levels. Let 
K = {1, ... ,K} be a finite set of region labels. Let g = [gi ■ i e R; gi G Q] 
and 1 = [tj : z G R; l-i G K] denote a piecewise-homogeneous digital greyscale 
texture and its region map, respectively, so that each pixel i = (m, n) G R is 
represented by its grey level gi and region label k. 

The spatially invariant geometric structure of pairwise pixel interactions over 
the lattice is specified by a pixel neighbourhood A. The neighbourhood points 
up a subset of pixels (neighbours) {(z + a) : a & A; i + a £ R} having each 
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a pairwise interaction with the pixel t C R. Each offset a = £ A defines 

a family of interacting pixel pairs, or cliques of the neighbourhood graph [2]. 
A quantitative strength of pixel interactions in the clique family Co = ■ 

i,j e R; i- j = a} is given by a Gibbs potential function of grey levels and 
region labels. 

The conditional Gibbs model in [7,8] specifies the joint probability distribu- 
tion of the region maps for a given greyscale image in terms of the characteristic 
neighbourhood A and the potential V = [Vp; Vo : a C A]. Generally, the 
potential of pixelwise interactions Vpix = [Vp{k\q) : fc G K; g G Q] depends on 
a grey value q and region label /c in a pixel. The potential of pairwise interactions 
Vo = [Va{k,k'\q,q') : {k,k') G K^; {q,q') G Q^j depends on region label and 
grey level co-occurrences in a clique (i, j) G Co. 

We assume for simplicity that the potential of pairwise interactions depends 
only on the grey level difference d = gi — gj; d € D = {—Q + 1, . . . , 0, . . . , Q — 1}. 
The inter-region grey level differences in the cliques of the same family are quite 
arbitrary for the various region maps of a piecewise-homogeneous texture. There- 
fore, the potentials should depend actually only on the region label coincidences 
a = 5{li — Ij) G {0, 1} so that Va{k, k'\q, q') = Va,a[k\q — q') where a = 0 for the 
inter-region and a = 1 for the intra-region pixel interactions. 

Obviously, the intra-region potentials Vaq{k\d) depend both on k and d. 
But the inter-region potentials Vafi{k\d) actually describe only the region map 
model and should be independent of region labels and grey level differences: 
Vafi{k\d) = Vafi- In the original model [7,8] both the intra- and inter- region 
potentials depend on k and d because the inter-region statistics is assumed to 
be similar for the training and test images. But in most cases this assumption 
does not hold. 

For the fixed neighbourhood A and potentials V, the modified conditional 
Gibbs model of region maps 1, given a greyscale texture g, is as follows: 

^a(l|g,V„)j (1) 

where ^g,v,A is the normalising factor, £'p(ljg, Vp) is the total energy of the 
pixelwise interactions: 



Pr(l|g,V,A) = 



^g.V.A 



exp ( F;p(l]g, Vp) + 

\ aeA 



£^p(i|g,Vp) 



E V^ik\g,) = |R 

i€R 



E Fp{q\g) 



E Fp(fc|g)Fp(fc]g,l,g) 



fceK 



( 2 ) 



and i?a(l|g,Va) is the total energy of the pairwise pixel interactions over the 
clique family C„: 



■®a(I|gi^a) E F^a,5{li—lj)ih\gi 9j) 

b,i)6C„ 



|R.|Po ( Va,0-Fa,o(l) + E Fa{d\g) E K,i{k\d) Fa,l{k\d,l, g] 
deD jfceK 



( 3 ) 



Here, Fp(gjg) and Fp{k\q, 1, g) denote the relative frequency of the grey level q in 
the image g and of the the region label k in the region map 1 over the grey level 
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q in the image g, respectively, and Fa{d\g), and Fa,i{k\d,l,g) denote 

the relative frequency of the grey level difference d over the clique family Cq, in 
the image g, of the inter-region label coincidences in the region map 1, and of 
the intra-region label coincidences for the region k in the region map 1 over the 
grey level difference d in the clique family Ca for the image g, respectively. The 
factor Pa = gives the relative size of the clique family Ca- 



2.2 Learning the Model Parameters 

As shown in [8], the potentials for a given training pair (l°,g°) have a simple 
first approximation of the maximum likelihood estimate. Because of the unified 
inter-region potential values, this approximation is now as follows: 

V/c e K; g e Q;d e D 
Vj^\k\q) = A[oiF(g|g°) {F{k\q,r,g°)-p) 

V^%\d) = A[0VaF„(d|g°)(F„,i(fc|d,l°,g°) - Ml) 



where p = Mo = 1 ^ and mi = are the marginal probabilities of 
region labels and their coincidences, respectively, for the independent random 
field (IRF) with the equiprobable region labels. 

The initial scaling factor A^®! in Eq. (4) is computed as 

ep(F,g°)+ E (e„.o(l°) + ea,i(l“,g°)) 

A[o] = C5) 

ep(l°,g°)V' + E (eo.o(l°)V'o +ea,i(l°,g°)V'i) 

aGA 

where 6p(l°, g°), ea,o(l°)) and eau(l°, g°) are the normalised first approximations 
of the Gibbs energies of pixelwise and pairwise pixel interactions: 

ep(l°,g°) = E -Fp(g|g°) E Fpik\q,1°,S°)iFp{k\q,r,g°) - p) 

q€Q fcek 

e„,o(F) =m2^„,o(1“)(T„,o(1°)-Mo) (6) 

=pI E F^{d\g°) E FaAk\d,F,g°){Fa,i{k\d,r,g°) - p,) 

dGD fceK 

and ifj = m(1 — p) and ipa = ^ Pa)] a = 0, 1, are the variances of the region 

label frequencies for the IRF. 

As in [7,8] , most characteristic interaction structure is recovered in this paper 
by choosing the clique families with the top values of the total energy of pairwise 
pixel interactions: 

ea(l°,g°) = e„,o(r) + e„.i(r,g°) 



2.3 Supervised Segmentation 

When the model of Eq. (1) is used for segmenting piecewise-homogeneous tex- 
tures, we assume that the first-order and second-order statistics of the images g 
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to be segmented and the desired region maps 1 in Eqs. (2) and (3) are similar to 
those of the training pair (l°,g°). This assumption is crucial because the likeli- 
hood maximisation algorithm we use for segmentation actually tries to minimise 
a (probabilistic) distance between the above statistics for the training pair and 
segmented image [8]. 

Because the model of Eq. (1) belongs to the exponential family of distribu- 
tions, the log-likelihood function L(V|l°,g°) = logPr(l° |g°, V, A) with a fixed 
neighbourhood A is unimodal [1] with respect to the potential V. The maxi- 
mum is approached by the stochastic-approximation-based Controllable Simu- 
lated Annealing (CSA) [8] that starts from an arbitrary initial region map 
e.g., from a sample of the IRE, with the initial Gibbs potentials of Eq. (4). At 
each step t, the potentials are modified as to approach the conditional pixelwise 
and pairwise statistics for the training sample (l°,g°) with the like statistics for 
the current simulated pair (l[‘l,g°). 

The CSA is easily adapted for approaching the maximum of an arbitrary 
log-likelihood L(V|1, g), providing that the pair (1, g) has the same conditional 
statistics as the training pair. The only change with respect to the maximisation 
of L(V|1°, g°) is that the training image g° is replaced with the test image g for 
generating the successive maps ll*! at each step t. 

Using the signal statistics for the training sample (l°,g°) and for the test 
image g with the region map generated by stochastic relaxation with the 
current potential Vl*l, the resulting process modifies the potential values as 
follows: 

V/c e K; <7 e Q; d e D; a e A 

Up[*+il(fc|g) = uW(A:|g) + AtAp(g|g) (Up(fc|g, U, g°) - Ep(fc|g, iW, g)) 

= Vl^\{k\d) + XtPaFa{d\g) (F„,i(/c|d,F,g°) - F„,i(fc|d,lW,g)) 

3 Experimental Results 

3.1 Segmenting an Arbitrary Training Sample 

The best results of the above approach should be expected for segmenting just 
the training image. For instance, the training sample in Figure l,a-b, contains 
five arbitrary chosen but spatially homogeneous natural and artificial textures. 

In this case the segmentation results in the region maps having from 17.12% 
to 0.0% of errors when the model of Eq. (1) contains, respectively, from one to 
six clique families with the top total energies (Figure l,c-h, and Table 1). 

In these experiments the search window for choosing the characteristic inter- 
action structures has 60 clique families with the short-range offsets (|^| < 5; |? 7 | < 
5). All the segmentation maps are obtained after 300 steps of the CSA with the 
scaling factor in Eq. (7) that is changing as follows: 

aW = Aio] 



1 + 0.001 
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Fig. 1. Training five-region texture (a), its ideal region map (b), and segmentation 
maps obtained with 1-6 clique families (c-h), respectively. 

Table 1. Gharacteristic clique families included successively into 
the model of Eq. (1), their Gibbs energies, and the relative seg- 
mentation errors e in Figure l,c-h. 



|A| 


1 2 3 4 5 6 


a = {Lr,) [1,0] [-1,1] [0,1] [-2,2] [-3,3] [-4,4] 


ea(F,g°) 


478.4 440.3 427.7 407.4 400.4 395.4 


e, % 


17.12 0.54 0.15 0.27 0.0 0.0 



The same number of steps and the same schedule for changing is used in all 
the experiments below. 



3.2 Segmenting Collages of Stochastic and Regular Textures 

To investigate the above segmentation in more detail, we use two types of homo- 
geneous textures, namely, stochastic and regular textures that can be efficiently 
simulated by the Gibbs models with multiple pairwise pixel interactions [8,9]. 
Figures 2 and 3 present collages of stochastic textures D4, D9, D29, and D57 
and regular textures Dl, D6, D34, and DlOl from [3]. 

The former four textures have the characteristic short-range interactions 
whereas the latter ones have mostly the characteristic long-range interactions. 
In both cases three textures from each group (the stochastic textures D4, D9, 
and D29 and the regular textures Dl, D6, and D34) possess similar statistics 
of the close-range pairwise pixel interactions (in terms of the relative frequency 
distributions of grey level differences). 

Below we use the collages in Figures 2, a and 2,e with the same region map 
in Figure 2,i as the training samples for each group of the textures. The search 
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Fig. 2. Four-region training (a,e) and test collages (b-d,f-h) of stochastic textures D4, 
D9, D29, D57 (a-d) and regular textures Dl, D6, D34, DlOl (e-h) with their ideal 
region maps (i-1). 

Table 2. Characteristic clique families with the top Gibbs en- 
ergies ea(l°,g°) selected for the training four-region collage of 
stochastic textures in Figures 2, a and 2,i. 



a = U,v) ( 1 , 0 ) ( 0 , 1 ) ( 1 . 1 ) (- 1 , 1 ) ( 2 , 0 ) ( 2 , 1 ) ( 0 , 2 ) (- 2 , 1 ) 

6a(l°,g°) 413.1 330.2 277.2 262.9 240.6 209.7 202.1 198.5 

a = {(,ri) (1,2) (3,0) (-1,2) (3,1) (2,2) (0,3) (-3,1) (-2,2) 

ea(l°,g°) 189.8 182.3 182.1 169.7 168.4 167.1 162.4 159.9 



window for choosing the characteristic interaction structure contains 3240 clique 
families with the short- and long-range offsets (|^| < 40; |? 7 | < 40). 

Piecewise-Homogeneous Stochastic Textures. In this case, relative errors 
of segmenting the training image using four, eight, or 16 characteristic clique 
families with the top-rank Gibbs energies are, respectively, 21.36%, 16.44%, and 
13.88%. Here, the ranking of the clique families by their Gibbs energies results in 
only the close-range interaction structures. The corresponding 16 clique families 
in terms of their offsets a = {^, rf) are shown in Table 2. 

The relative errors of segmenting the training sample are slowly decreasing 
with the neighbourhood size (for instance, 10.15% for |A| = 36). Segmentation 
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Fig. 3. Four-region test collages of stochastic textures D04, D09, D29, D57 (a-d) and 
regular textures Dl, D6, D34, DlOl (e-h) with their ideal region maps (i-1). 



errors for the test images in Figure 2,b-d and 3, a-d depend in a similar way 
on the neighbourhood size so that most of the experiments below are conducted 
with the same fixed characteristic neighbourhood of size |A| = 16. 

Segmentation of the test images yields larger error rates (23.39-32.43%) 
caused mostly by misclassified parts of the textures D4 and D9 (Figure 4). The 
main reason is that the stochastic textures D4, D9, and D29 have very similar 
signal statistics over the chosen characteristic short-range neighbourhood. Thus 
the individual regions produced by segmentation (Figures 6 and 7) differ from the 
ideal maps although they are quantitatively, in terms of the chi-square distances 
between the training and test statistics, and even visually quite homogeneous. 
Table 3 demonstrates how the segmentation separates the individual textures. 

These experiments demonstrate the basic difficulty in segmenting stochastic 
textures by taking account of characteristic conditional pairwise statistics. The 
close-range characteristic neighbourhoods selected by ranking the total Gibbs 
energies for the clique families may not be adequate for separating these tex- 
tures because the conditional signal statistics similar to the training ones can be 
obtained for regions that differ much from the ideal ones. 



Piecewise-Homogeneous Regular Textures. In this case we can expect 
more efficient segmentation using the conditional model in Eq. (1) because simu- 
lations of these textures involve usually characteristic long-range interactions [9] . 
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Fig. 4. Segmentation of the collages of stochastic textures using the neighbourhood 
of size 16; the top and the bottom region maps a-d are obtained for the collages 
in Figures 2, a-d, and 3, a-d, respectively. Black regions under each map indicate the 
segmentation errors. 



Table 3. Segmentation results (in %) for the training and test 
collages of stochastic textures (the rows correspond to the ideal 
regions, and the columns show how many pixels of each ideal 
region are actually assigned to a particular texture). 

Training collage Test collages 

D4 D9 D29 D57 D4 D9 D29 D57 

D4 75.4 15.9 8.5 2.0 47.3-73.0 20.3-44.4 5.5-21.0 0.0-5. 3 

D9 11.6 82.6 4.2 1.6 6.3-31.7 58.3-73.8 4.9-30.4 0.4-27.2 

D29 9.2 2.6 87.4 0.8 4.4-25.6 3.4-9. 2 64.5-89.6 0.0-2. 1 

D57 0.4 0.2 0.3 99.0 O.l-l.O 0.0-9.4 0.0-2.7 89.2-99.6 
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Fig. 5. Homogeneous regions found by segmenting the training and test collages of 
stochastic textures (the top row of the maps a-d in Figure 4). 




Fig. 6. Homogeneous regions found by segmenting the test collages of stochastic tex- 
tures (the bottom row of the maps in Figure 4,a,b). 
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Fig. 7. Homogeneous regions found by segmenting the test collages of stochastic tex- 
tures (the bottom row of the maps in Figure 4,c,d). 

Table 4. Characteristic clique families with the top Gibbs en- 
ergies ea(I°,g°) selected for the training four-region collage of 
regular textures in Figures 2,e and 2,i. 



a = {^,v) ( 1 , 0 ) ( 0 , 1 ) ( 1 , 1 ) (- 1 , 1 ) ( 2 , 0 ) ( 0 , 2 ) ( 2 , 1 ) ( 1 , 2 ) 

6a(l°,g°) 613.7 549.2 360.8 353.5 325.6 323.2 189.8 189.3 

a = {(,ri) (0,2) (-2,1) (3,0) (0,3) (2,2) (3,1) (-2,2) (-3,1) 

6a(l°,g°) 183.3 182.1 173.0 159.9 66.8 64.5 60.1 59.0 



Actually, the relative error of segmenting the training collage in Figure 2,e is 
2.67% with the same neighbourhood size of 16 although once again the top-rank 
total energies correspond to only the close-range interactions (Table 4). 

Figure 8 shows results of segmenting the training collage and test collages of 
regular textures in Figures 2,e-h and 3,e-h, respectively, using the characteristic 
neighbourhood of size 16. The relative errors for the test collages are 1.66- 
14.09%. The test collages in Figure 3,e-h result in less precise segmentation 
because of many small subregions in these textures that effect the collected 
conditional statistics. 

The individual homogeneous regions found by segmentation are shown in 
Figures 9-11, and Table 5 demonstrates the separation of these textures. Most 
of the errors are caused by the textures D6 and D34 with very similar uniform 
backgrounds resulting in close similarity between their conditional statistics of 
short-range grey level differences. 

The desired distinctions between the close-range conditional statistics may 
exist also for certain spatially inhomogeneous textures. For example, the collages 
of regular textures in Figure 12, a, c contain both the homogeneous regular tex- 
tures D20, D55, D77 and the weakly inhomogeneous texture D36 from [3]. The 
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Fig. 8. Segmentation of the training and test collages of regular textures using the 
neighbourhood of size 16: the top and bottom region maps e-h are obtained for the 
collages in Figures 2, e-h, and 3, e-h, respectively. Black regions under each map indicate 
the segmentation errors. 



error rate of segmenting the training image with the characteristic short-range 
neighbourhood of size 16 is 28.63%. When the size is extended to 36 then the 
error rates of segmenting the training and test collages are 5.15% and 16.71%, 
respectively. The resulting region maps for the neighbourhood of size 36 are 
shown in Figure 12,b,d. The main errors in this case are due to assigning small 
border parts of the textures D20 and D55 to the texture D77 having similar 
statistics of the close-range grey level differences. 



4 Concluding Remarks 

These and other our experiments show that the supervised segmentation by 
approaching the maximum conditional likelihood is efficient for textures with 
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Table 5. Segmentation results (in %) for the training and test 
collages of regular textures (the rows correspond to the ideal 
regions, and the columns show how many pixels of each ideal 
region are actually assigned to a particular texture). 



Training collage 
D1 D6 D34D101 
D1 96.0 4.0 0.0 0.0 

D6 1.6 97.0 1.4 0.0 

D34 0.8 2.4 96.4 0.4 

DlOl 0.1 0.0 0.0 99.9 



Test collages 



D1 


D6 


D34 


DlOl 


81.9-99.9 


0.1-17.7 


0.0-0.8 


0.0-7.7 


1.0-16.5 


75.7-97.6 


0.0-17.9 


0.0-5.0 


0.0-4. 2 


1.0-16.7 


78.5-99.0 


0.0-2.1 


0.0-0.4 


0.0-0.3 


0.0-1. 6 


97.7-99.9 



D1 



D6 



D34 DlOl 



e 



f 

Fig. 9. Homogeneous regions found by segmenting the training and test collages of 
regular textures (the top row of the maps in Figure 8,e,f). 




different characteristic interaction structures. But the overlapping close-range 
structures and similar pairwise signal statistics of individual homogeneous tex- 
tures may result in a segmentation map that differs considerably from the ideal 
one although both the maps possess conditional signal statistics similar to the 
training ones and have visually homogeneous textured regions. 

The modified conditional Gibbs model allows to accelerate segmentation com- 
paring to the previous two-stage scheme [7,8] by obviating the need for the initial 
stage. This latter sets to zero the inter-region potentials in order to roughly ap- 
proximate the desired homogeneous regions even though their inter-region signal 
statistics are different in the test and training samples. Then the initial (and usu- 
ally quite “noisy” ) region map is refined at the final stage by using both the intra- 
and inter-region potentials. 

The one-stage process of Eq. (7) forms the final region map starting directly 
from a sample of the IRE. Simultaneously the accuracy of segmentation is slightly 






Fig. 11. Homogeneous regions found by segmenting the test collages of regular textures 
(the bottom row of the maps e-h in Figure 8). 
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Fig. 12. Region maps (b) and (d) and the corresponding homogeneous regions obtained 
by segmenting the training (a) and test (c) collages of regular textures D20, D36, D55, 
and D77, respectively. 



improved in that the modified model yields smaller differences between the error 
rates for the training and test images. For instance, the two-stage segmentation 
in [8] results in the error rates of 2.6% and 29.0-35.2% for the training and 
test collages of the textures D3-D4-D5-D9 from [3], respectively, or 1.9% and 
15.3-33.0% for the like collages of the textures D23-D24-D29-D34, and so forth. 
The one-stage segmentation based on the modified Gibbs model produces more 
predictable results for the test and training samples like 13.9% vs. 23.4-32.4% 
for the stochastic textures with very similar close-range pairwise signal statistics 
or 2.7% vs. 1.7-14.1% for the regular textures, respectively. 

Our experiments show that the choice of most characteristic pixel neigh- 
bourhoods should be based not only on partial Gibbs energies of pairwise pixel 
interactions but also on the accuracy of segmentation. If the top-rank Gibbs 
energies correspond mostly to the close-range neighbourhoods then these lat- 
ter can be efficient only for segmenting the textures with sufficiently different 
close-range pairwise signal statistics of the homogeneous regions. 
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Abstract. Moire phenomena occur when two or more images are non- 
linearly combined to create a new “superposition image” . Moire patterns 
are patterns that don’t exist in any of the original images but appear in 
the superposition image for example as the result of a multiplicative su- 
perposition rule. The topic of moire pattern synthesis deals with creating 
images that, when superimposed, will reveal certain desired moire pat- 
terns. Conditions ensuring that a desired moire pattern will be present in 
the superposition of two images are known, however they do not specify 
these images uniquely. The freedom in choosing the superimposed images 
can be exploited to produce various degrees of visibility and ensure de- 
sired properties. Performance criteria for the images that measure when 
one superposition is better than another are introduced. These criteria 
are based on the visibility of the moire patterns to the human visual 
system and on the digitization which takes place when presenting the 
images on discrete displays. We here propose to resolve the freedom in 
moire synthesis by choosing the images that optimize the chosen criteria. 



1 Introduction 

The term moire comes from French where it refers to watered silk. The moire 
silk consists of two layers of fabric pressed together. As the silk bends and folds, 
the two layers shift with respect to each another, causing the appearance of in- 
terfering patterns. The moire technique for manufacturing clothes was developed 
in China a long time ago, and later introduced to France in 1754 by the English 
manufacturer Badger. Natural moire phenomena can be seen in daily life, for ex- 
ample in the folds of a moving nylon curtain or when looking through parallel 
wire- mesh fences. The first scientific observations were made by Lord Rayleigh 
[1] who suggested to use the moire phenomenon for testing quality of gratings. 

Two goals exist in moire patterns research. The first is the analysis of moire 
patterns. This usually involves some physical situation in which moire patterns 
appear either naturally or by human intervention. The task is to analyze and 
characterize the patterns. Most of the research in moire patterns analysis deals 
with finding equations describing the moire patterns. In moire pattern synthesis 
the generation of certain moire patterns is required. The synthesis process in- 
volves producing two images such that when these images are superimposed the 
required moire patterns emerge. Moire synthesis and analysis are tightly linked 
and understanding one task gives insight into the other. 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 185-200, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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Over the years different methods to model and analyze the moire phenomenon 
have been suggested. We shall describe two main approaches to model the moire 
phenomenon in the next section. Section 3 presents the moire pattern synthe- 
sis problem and in section 4, criteria for measuring the performance of moire 
patterns in a superposition are introduced. Section 5 discusses the integrability 
constraint which ensures that a certain vector field is a gradient field. Section 6 
reviews some basic results from variational calculus and their use in moire syn- 
thesis and section 7 addresses the problem of recovering the potential function 
of a gradient field. Section 8 concludes with some results and closing remarks. 

Throughout this paper we will discuss the case of superposition of two images 
for the sake of simplicity. It is not too difficult to extend the results to several 
superimposed images. We will assume a multiplicative superposition rule. Such a 
rule is motivated by the multiplication effect implicit in laying transparencies on 
one another. It is, however, possible to consider other superposition rules [2]. The 
nonlinearity of the multiplicative superposition allows new frequencies, which 
do not exist in the original image to appear in the superimposed image. In fact 
nonlinearity is at the heart of the moire phenomenon and linear superposition 
like addition does not elicit it (see Fig. 1). 

2 Moire Pattern Analysis 

Two models for moire pattern analysis are reviewed. The indicial equation 
method operates in the image plane and the Fourier domain method operates in 
the frequency plane. The following model description follow closely [3] in which 
more detailed descriptions appear. 



2.1 The Indicial Equation Method 

The simplest and oldest model for analyzing the geometric shape of moire pat- 
terns in the superposition of two curvilinear gratings is the indicial (or paramet- 
ric) equation method surveyed in [4] and [5]. This model is based on the curve 
equations of the original curvilinear gratings. If each of the original layers is 
regarded as an indexed family of curves, the moire pattern of the superposition 
forms a new indexed family of curves, whose equations can be inferred from the 
equations of the original gratings. 

According to this model the original images consist of black curves on a 
white background. The curves in each image are assumed to be the equal height 
contours of two dimensional functions. We thus have two images which consist 
of black curved gratings whose centerlines are the equal height contours of two 
functions 'tp{x,y) and <f>{x,y) as follows 

ip{x, y) = m m e Z 

(j){x, y) = n n G Z (1) 

Since the images are binary the multiplicative superposition is also an AND 
operator. Each of the curves in both images has an index given by the height 
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Fig. 1. The two gratings in a and b are Fig. 2. Subtractive and additive moire 

multiplied in c and added in d. Image in the indicial equation model 

brightness is scaled 



of the respective contour. As a result adjacent curves get adjacent integers as 
indices. We denote the indices of the curves on one image by an integer variable 
m and the indices of the curves in the second image by the integer variable n. The 
coordinates rn, n define an (m, n) net. In each point of the (m, n) net, an rn and 
an n curve interseet. The (fci, k2) moire eurves are defined as the curves joining 
the intersection points of m and n curves whose indiees obey kim + k2n = I 
where fci, k2 are constant integers and I runs over the set of integers. 

Coneeptually by letting m and n vary eontinuously, the {k\,k2) moire eurves 
obeying 

k\m + k2n = I I £ Z ( 2 ) 

become continuous curves, that may be regarded as equal-height contours of a 
new bivariate function g{x, y). 

The order of the (fci,/c2) moire is defined to be the highest absolute value 
of k\, k2- The first order moires are therefore the additive moire u + v = I and 
the subtraetive moire u — v = 1 . The first order moire patterns are the curves 
connecting the intersection points of constant sum and constant difference of the 
curves index values. In Figure 2 , (a) and (b) show two binary images whose curves 
correspond to equal height contours of a round summit. In (c) the superposition 
of (a) and (b) is exhibited and in (d) the superposition image is zoomed in. 
The subtractive moire in (d) is described by the short diagonals of the curved 
parallelograms and the additive moire is described by the long diagonals. Clearly, 
not all the (fci, /C2) moires visually stand out. The visibility of the moire patterns 
will be discussed in greater length later. Usually only first order moires stand out 
- if at all. In the case of first order moires sometimes only the additive or only the 
subtractive moire stand out and sometimes no moire is apparent. Substituting 
( 1 ) in ( 2 ) results in eliminating the indices m, n 

kiip{x, y) + k24>{x, y) = I I £ Z. 



( 3 ) 
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Fig. 3. The two raised cosine gratings 
and their spectra 




Fig. 4. Periodic profiles of a linear 
function 



We can thus state that the centerlines of the {ki.k^) moire corresponds to the 
equal height contours of g{x,y) = ki'tp{x,y) + k 2 (p{x,y). The indicial equation 
methods enables a complete geometric specification of the moire curves based 
on the implicit geometric specification of the curves in the two original images 
as equal-height contours of bivariate functions. The indicial equation method 
has several drawbacks. In order to use the indicial equation we need explicit 
analytic expressions for the curves which may not be readily available. Although 
the method gives us a condition for the possible moire curves , it does not 
tell us whether the moire pattern will indeed be visible to the human eye. For 
example, consider the superposition of a family of horizontal lines and a family 
of vertical lines. The superposition will consist of small squares. There will not 
be any visible pattern although the first order moire patterns are diagonal lines 
connecting the opposite vertices of each square. 



2.2 The Fourier Domain Method 

So far we assumed that the original images were binary images showing the equal 
height contours of ip{x, y) and (j){x, y) as black curves on a white background. 
We now generalize this setting through the concept of periodic profile. 

The original images are allowed to be p{^{x,y)) and p{(j){x,y)) where p{z) 
is a periodic function of one variable with period of one. If p{z) is taken to 
be a discrete impulse train: p{z) = 1 when z e Z and p{z) = 0 else, the images 
p{iIj{x, y)) and p{(f>{x, y)) reduce to binary images with curves representing equal- 
height contours of ^ and (p. 

In this section we will restrict ourselves to linear ip and (f>: 'ip{x, y) = p\x+qiy, 
<p{x, y) = P 2 X + Q 2 y pi,p 2 ,qi,q 2 £ IK- Note that p{i>{x, y)) and p{<j>{x, y)) will 
be periodic images. In Figure 4 (a), a linear function ■ip{x,y) is shown. In (b) (c) 
and (d), p{'ip{x,y)) is shown, where the periodic profile p is a discrete impulse 
train, cosine and square wave grating respectively. 
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Although we here restrict ourselves to linear functions, results obtained in 
this case are useful since many functions can locally be approximated by a linear 
function (the first two terms of the 2D Taylor expansion). 

The Fourier domain method analyzes the moire pattern in the frequency^ 
domain. According to the convolution theorem, the multiplicative superposition 
rule in the image domain transforms to a two dimensional convolution between 
the spectra of the original images 

g{x,y) =p{ip{x,y)) ■p{(j){x,y)) G{u,v) = J^[p{tp{x,y))] * J^[p{(f){x,y))]. 

( 4 ) 

The Fourier transform of p{x) = is P{u) = 5{u — f). In the two 

dimensional domain the Fourier transform of p{x,y) = e^^-^*-/ia;+/ 2 y) jg p(^n v) = 
5{u — fi,v — / 2 ). Since p is a periodic function, it can be expanded to a Fourier 
series p{x,y) = Ylm=-oo Z^^-oo The Fourier transform of 

p{x, y) is readily obtained from the Fourier series decomposition and linearity 
of the Fourier transform P{u, v) = J2m=-oo S~=-oo „d(u — muQ.v — nvo). 
Furthermore, by the convolution theorem, G(u, v) consists of translated and 
scaled impulses as well. 

Since ^(x, y) = pix + qiy 4>{x, y) = P 2 X+q 2 V, p(V’(x, y)) and p{4>{x, y)) have 
frequency components only in the direction of the gradients Vtp and V</>. This 
means that the frequency domain representation of p(V’(x, y)) and p{4>{x, y)) will 
have impulses only along the lines piu + q\v = 0 and p2U + q2V = 0. This can also 
be seen from the fact that p{tp) and p(</)) are rotated ID periodic functions on 
the X axis. Therefore, P[p{‘ip{x,y))] and P[p{4>{x , y))] are obtained by rotating 
the ID spectra from the u axis by the same angles. In the case of a raised cosine 
profile p(-) is given by p{'tp[x, y)) = | cos( 27 t^(x, y)) + |, and only three impulses 
will exist in the frequency domain. Two impulses at either sides of the origin 
are contributed by the cosine function. These two impulses are at distance one 
from the origin and lie along the line p\u + q\v where (j) = Pi^ + Qiy- The third 
impulse is contributed by the constant term and lies at the origin (see Figure 3). 

The convolution of the impulse spectra is performed as a discrete convolution. 
The location of each impulse in the superposition spectrum will be the vectorial 
sum of the locations of two impulses, one from each original image. We will label 
the (fci,/c 2 ) superposition impulse as the impulse whose location is created by 
the vectorial sum of the ki impulse in the first original spectrum and the k 2 
impulse in the second original spectrum. The amplitude of the (fci, fc 2 ) impulse 
is the product of the amplitudes of the k\ impulse in the first original spectrum 
and the k,2 impulse in the second original spectrum. 

Each impulse in the 2D spectrum is characterized by three main properties: 
its label, its geometric location and its amplitude. To the geometric location of 
an impulse, a frequency vector / is attached (in the frequency domain). This 
vector can be expressed in polar coordinates (/, 0) where / is the distance of 
/ from the origin and 6 is the angle of /. In terms of the image domain, the 

^ In this paper we adopt the following Fourier transform convention (see [6]) F{u, v) = 
fix, fix, y) = F(u, 
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geometric location of an impulse in the spectrum determines the frequency / 
and the direction 0 of the corresponding periodic component in the image. The 
amplitude of the impulse represents the intensity of that periodic component in 
the image. 

The geometric locations which include impulses in the superposition spec- 
trum but do not include impulses in the original spectra represent the moire 
patterns. These impulses represent new frequency components created by the 
superposition and not by one of the original images alone. Impulses which are 
labeled f(o,i) or /(i,o) for some i exist in one of the original images since /o 
represent the DC term. 

In the case of other profiles such as square wave, we will have also higher 
order moire impulses. The spectrum of a linear square wave grating is an infinite 
set of impulses along the frequency direction. Convolution of two such impulse 
trains will result in an infinite lattice covering the entire frequency plane. 

However, we will show that the amplitude of impulses away from the origin 
tend to zero. We can disregard impulses with low amplitude since their effect on 
the image is small. 

More formally, let p(z) be a periodic function whose values are bounded 
between 0 and 1. Then all its Fourier series coefficients (impulse amplitudes) 
have absolute values between 0 and 1: p{x) = X^m=-oo ^ — I''™! ” 

Furthermore, it is true for any convergent Fourier series that \cm\ — > 0 as 
m| tend to oo. Moreover, if p{x) has n continuous derivatives, then its Fourier 
transform P{u) tend to 0 as |u| — s> cx) at least as fast as l/u"+^ (see [7] page 74): 
lim|„j^oo \P{u) \ = O (— ■ Thus, the smoother the function, the more rapidly 
the coefficients of the series tend to zero. 

Recall that the spectra of the original images is given by P{u) rotated along 
the frequency directions. Also, the amplitude of the impulse is the prod- 

uct of the amplitudes of the k\ and k 2 impulses in the original images spectra. 
Therefore, the amplitude of the (/ci,/c 2 ) impulse in the superposition will tend 

/ \n-l-l 

to 0 faster than ( 1 where ui and U 2 are the locations of the k\ and k 2 

impulses. 

The Fourier approach has the advantage that it enables to analyze the moire 
patterns in the frequency domain which is a suitable domain for deciding the 
response of the visual system. In the case of nonlinear gratings, the spectrum 
of the gratings no longer consists of impulses and may be continuous. It is then 
impossible to analyze the superposition spectrum with the same ease as before. 
However, a local frequency analysis can use the above results since any smooth 
function can be approximated by a linear function in a small enough region. 

3 The Problem of Moire Synthesis 

Synthesis of moire patterns is the generation of two images that when superim- 
posed will reveal the intended moire pattern. We restate the condition for the 
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generation of a certain moire pattern. Let xp{x,y),<f){x,y) be two 2-D functions 
and let p{z) be a 1-D periodic function with period of unity. The superposition 
p{tp{x,y)) ■ p{(f>{x,y)) will contain the {kijka) moire pattern whose geometric 
layout is the equal height contours of g{x, y) if the following condition holds 

g{x, y) = ki'ip{x, y) + k 2 (j){x, y). (5) 

This condition determines (j) completely from g and ip. There is, however, a 
degree of freedom in choosing p) or cp. The moire pattern described by g will 
stand out visually for some choices of ^ and yet it may be hidden for other 
choices due to the inherent filtering process of the human visual system. 

We will first determine criteria for evaluating the superposition image. Based 
on these criteria we will choose rp and p that will satisfy (5) and optimize our 
performance criteria. 

From here on, we will deal with synthesis of first order (1,-1) moire patterns 
since higher order moire patterns are generally less dominant (see section 2.2). 
Synthesis of the (1,1) or higher order moire patterns requires some straightfor- 
ward modifications. 

4 Performance Criteria 

When synthesizing moire patterns we have to choose %p and <p from an infinite set 
of functions which satisfy (5). In this section we propose criteria to estimate the 
visual performance of the moire. Based on these criteria we can decide whether 
one choice of ip and <p that satisfy (5) is better than another. In section 7 we use 
these criteria to optimally choose ip and (p. 

We will first consider performance criteria for linear moire. We assume that 
the superposition image and the resulting moire patterns are approximately 
linear in a small enough region. Based on local analysis and the results for linear 
moire performance we will formulate criteria for general moire patterns. 



4.1 Linear Moire Performance 

The filtering process of the human visual system is nicely demonstrated in two 
simple experiments. A figure of cosine vertical bars with continuously varying 
frequencies and amplitudes (see [8] Figure 2.4-3 p. 35) demonstrates sensitivity 
to the spatial frequency of the cosine bars. Comparing a figure of a checkerboard 
with a rotated duplicate (see [9] Fig. 5.2 p. 83) demonstrates the non-isotropy of 
the visual system filtering: The human visual system is more sensitive at 0 and 
90 degrees than at 45 degrees to changes with equal contrast and frequency. 

The contrast of a pattern I is defined by C = where Imax and 

Imm are the maximum and minimum intensities in the pattern respectively. 
For an absolute uniform image, Imax = Imtn and C = 0. For a square wave 
gratings, Imax = ^,Imin = 0, C = 1. Since the denominator is proportional to 
the mean intensity, contrast can also be considered as the degree of modulation 
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of intensity above the mean. For sinusoidal gratings the contrast is proportional 
to the amplitude. 

The contrast sensitivity function (CSF) of a system is defined as CSF = 
in^mt contrast ' always feasible to measure the output tor humans in a 

completely controlled fashion. Necessarily, psychological experiments are used, 
requiring many assumptions about the system behavior. 

The experimental procedure for measuring the CSF involves presenting each 
of a set of vertical sinusoidal gratings on a visual display to a viewer who can 
vary the contrast control while maintaining constant average luminance. For a 
given pattern coarseness, the viewer is asked to adjust the contrast until the 
grating is just barely distinguishable. The threshold of contrast perception c{u) 
is obtained at different spatial frequencies and the contrast sensitivity function 
is CSF{u) = (the constant a is assigned to barely distinguishable contrast). 

Dooley (see [10] p. 118) has provided the following equation to fit the data 
from the above experiments CSF{u) = |5.05 (e^o i3S«^ (l _ Where u = 

2nf, f is the spatial frequency along the x axis in cycles per degrees and a = 
0.005. 

A review of the non- isotropy of the visual system [11], describes results of 
the following experiment. Gratings were presented to a viewer at different angles 
and different distances. The distances at which the gratings were barely visible 
represent the sensitivity of the visual system to orientations. 

As a measure of the visibility of an impulse whose location on the frequency 
plane is / we take 



G(/) = Ffi(l|/||)-£f2(angle(/)) (6) 

Where Fl\ and H2 are the functions obtained from the above two experiments. 

Recall that for the raised cosinusoidal profile we have only the first order 
moire. In moire pattern synthesis, we receive the desired pattern p{g{x,y)) = 
p{pix + qiy) as input. The gradient of the desired pattern V5 = (p, q) points at 
the required frequency direction. In addition, l/|j V,g|] is the distance between two 
adjacent periods of p{g{x, y)). Therefore the magnitude of the spatial frequency 
of p{g{x,y)) is ]|V5'||. In other words, for p{g{x,y)) to appear as the (1,-1) 
moire, should be equal to V,g. Since in moire synthesis we receive g as 

input, the location of /(i,-i) is set by V g when (5) is satisfied. 

The freedom in choosing different ijj and cj) that satisfy (5) allows controlling 
the location of /(i.i)- For the (1, —1) moire to be visible, we should minimize 
the visibility of the (1, 1) moire. The optimal (j)opt,4’opt for this minimization is 

4iopt = argminiJi(l|/(i,i)||) • i?2(angle(/(yi))) (7) 

4 > 

'4^0pt 9 F <Popt (S) 

can be computed by = f<j>F ftp = V</) + Vi/) = 2V<)> + V,g where we 

used the following result g = 'i/' — = + 

As ||Vi?!»l| is increased, the frequency vector that corresponds to the (1,1) 
moire becomes larger in magnitude. According to the visibility function (6) this 
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means that the visual system will be less responsive to this frequency, hence the 
performance will improve as ||V</)|| is increased. 

Such uncontrolled improvement in the performance becomes problematic 
when we use digital media to represent the images. As we further increase || V0|| 
we will get an additional strong unwanted moire between the grating and the 
pixel frequency of the display due to aliasing. 

To account for this effect in the performance criteria, another term M, will 
be added to (6) as follows 

= -ffi(ll/(i,i)ll) ■ -H‘2(angle(/(i,i))) + Af(||/(i,i)||) (9) 

This term will become dominant for very high frequencies and prevent the 
unbounded decrease in (9). This digitization term ||) should be negligi- 

ble for low frequencies and dominant for high frequencies. In addition to choosing 
M with these properties, we should choose the crossing point 

^^"(11/(1, 1)11) = -ffi(ll/(i,i)ll) • mini/2(angle(/(i_i))) (10) 

with care. To do so the function M ( ■ ) was chosen to be of the form M ( ■ ) = rriM{ ) 
where M(-) is an increasing polynomial and m is a parameter, m is determined in 
order to set the crossing point (10) as described below. We define the digitization 
threshold 1/ as the frequency at which two periods of the gratings ijj and (p start 
to merge on the display. We denote 4> and as the functions computed by 
equations (7) and (8). We would like ||V0|| and ||Vi/)|| to be smaller than the 
digitization threshold by ei > 0: ||V0|| < T/ — ci, ||Vi/)|| < Tf — ei Since the 
choice of M affects the choice of p and only then 'ip is computed, we will explore 
the relation between ||V0|| and HVi/lll: g = ip — p, Vip = Vg + Vcp, ||Vi/’|| = 
||V,g + V(()|j < ||Vg|| + ||V(/)|| If the condition ||V<^|| < Tf — e\ holds, we have 
IIV^II < T> - ei + ||Vg|| 

We denote the frequency of the crossing point as fcp- If we assume that 
1 1^0! I ^ /cp + £ 2 , £2 > 0 we arrive at the following result: If we choose the 

crossing point frequency fcp according to fcp < Tj — ||Vg|| — ei — £ 2 , 0 and ip 
will satisfy ||V0|| < Tj — t\ — ||Vg||, ||Vi/)|| < Tf — Intuitively, the value 

of ei represents how much we would like to stay away from the digitization 
threshold and £2 represents the possibility that the minimization procedure will 
carry ||V0|| beyond the crossing point. 

4.2 Performance of Non-linear Moire Patterns 

The visibility of the (1,1) moire in a general superposition over a region 1? is 
defined as IT(/(i 4 ))(f 2 ) = ff ^V(f(i^i)(x,y))dxdy were P(/(i,i)(x, y)) is the 
function defined in (9). Over a discrete image I of size M x N we have 

M N 

W^(/(1.1))U) = EE V{f(i,i){i,j)) 

1=1 j=i 



( 11 ) 
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5 The Integrability Constraints 

When minimizing the visibility of the (1,1) nioire equation (11) depends on 
f(i,i){x,y) = 2SJ(j){x,y) + Vg(x,y). Since (11) depends explicitly on V</>, we 
may state the optimization problem as 

for each (i,j) find V</)opi(bi) that will minimize l^(/(i,i) (*, j))- 

The problem with such a scheme is that the obtained vector field Vc/^opt may 
not be a conservative field. This means that no function can be found, that Vcpopt 
will be its gradient. 

Enforcing the integrability test for V0, we are led to the following problem: 
V(f)opt = argminlT(/(i, !))(/) (12) 

V(t> 

subject to. 4^xy — 4^yx 

As we shall see in the next section our variational solution will not impose 
integrability as a hard constraint, but we shall enforce it approximately via a 
penalty term. 



6 Results from Variational Calculus 



In this section we will state some results from variational calculus that are used 
in the following sections. For the cases below see also [12] and for a more complete 
description with derivations refer for example to [13] or [7]. 

The calculus of variations deals with minimizing functionals. A functional is 
a function from a set of functions to the real line. A fundamental result of the 
calculus of variations is that the extrema of functionals must satisfy an associated 
differential equation called the Euler equation over the domain. 

The Euler equation is a necessary but not sufficient condition for the existence 
of an extremum. By extrema we mean local minima, maxima and inflection 
points. We assume that all functions and functionals are continuous and have 
derivatives. Another assumption is that the functional values are positive. 

For example a functional I\ which depends on a bivariate function z{x,y) 
as follows Ii[z\ = jjjj ^F{x,y,z,Zx,Zy)&xdy yields the following Euler equation 



p Ap Ap =0 

Xz dy^^v 



The functional I 2 which depends on the gradient of z{x,y), i.e. Vz{x,y) = 
ip{x,y),q{x,y)) as follows / 2 b, g] = JJ ^ F{x,y,p,q,p^,py,q:,,qy)dxdy yields a 
coupled set of differential Euler equations Fp — -§^Fp^ — -§^Fp^ = 0, Fq — — 

-§^Fq^ = 0. The Euler differential equations require boundary conditions to have 
a specified solution. However, in many problems there are no imposed prior con- 
ditions on the boundary values or the behavior of the function at the boundary 
may be restricted by some general conditions. In such cases, the variational calcu- 
lus supplies us with further conditions for the boundary values. These conditions 
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are also necessary conditions for the functional to be stationary with respect to 
variations (see [7] page 208). Such conditions are called natural boundary condi- 
tions. In the case of Ji, the natural boundary condition is )-n = 0 where 

n is the normal to the parametric curve representing the boundary of Q. For I 2 , 
the natural boundary conditions are [Fp ^ , Fp ^ ) • n = 0, , Fq ^ ) ■ n = 0 

Recall that the visibility of the (1, 1) moire in a small area surrounding (x, y) 
is expressed by V (p(x, y), q{x, y)) where (p(x, y), q{x, y)) = V(f>. Adding a penalty 
term which represent the integrability constraint and squaring V results in the 
following functional 

b = jj ^ q) + HPy - Qxf) da;dy. (13) 

Equation (13) is in I 2 form and its Euler equations are 



y^p F ^{Pyy Qxy) — d, yyq F ^{Qxx Pyx) — d 

By discretizing (14) the following iterative scheme is obtained 



fc+i 

F,j 



■ p’" ■ 



1 



1 



= q^ ■ 



i^ql, i^y{plpql,)yv{plv<2) 

- ^y)pi 



- ^y)plv<j)yMv<3 



(14) 



F Pi j—i _ Qi+ij A qi—i jx 
p^,J = ^ > q^,] = ^ 

_ Pi+l,j + l + Pi-1.1-1 “Pi-K.i-l “ Pi-1, i + 1 
Puj - 4 

+ l — — l Qi+l.j — 1 Qi—l,j-\-l 

q^,j - I 

The choice of including the integrability constraints as a penalty term works 
better than other approaches that try to strictly enforce the integrability con- 
straints [12]. The parameter A enables control of the trade-off between a 
“smoother” vector field that will enable better recovery of (f> and a vector field 
which reaches lower visibility. 

As initial conditions, we took an arbitrary vector field. The boundary con- 
ditions are described in section 9. Note that the natural boundary conditions in 
this case reduce to the integrability condition py = q.p on boundary. 



7 Recovering Height from Gradient 

7.1 Height from Gradient Problem 

The height from gradient problem deals with following problem: Given a vector 
field F(x, y), find a function (j){x, y) such that V(j){x, y) = F{x, y). 

Note that the solution (f> is not unique since adding a constant term to (p 
will result in another solution to the problem. This problem can therefore be 
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classified as an initial value problem: Given an initial value at some loeation 
4>{xo,yo) and F{x,y), find 4>{x,y) for all the region. 

A simple solution to this problem is 

0(x,y) = (/>(xo,yo) + / • dL (15) 

Jc 

Where C is a curve from (xq, yo) to (x, y). This method allows us to compute 
completely once an initial value i/i(xo,yo) is determined. The problem with 
equation (15) is that it is numerieally unstable. A height value at some point 
would, in the presence of noise, depend on the integration path that was taken. 
It is better to find a best fit surfaee cf)* to cf). This can be accomplished by a 
variational ealculus setting [12]. The variational approach to height from gradient 
is discussed in the next subsection. 



7.2 Variational Calculus Setting for Height from Gradient 

Given the vector field F{x,y) = {p{x,y),q{x,y)) and a possible approximate 
solution 0* we wish to minimize the following functional 

- P)^ + - qfdxdy (16) 

Calculating the Euler equation for (16) yields A(f>* = Px + Qy Where Acj)* is 
the Laplacian of <p*: This equation is a second order elliptic PDE 

called Poisson equation. The Poisson equation is widely studied and many pro- 
cedures for numerical solutions exist. In our experiments we used two methods 
to solve this equation. One is a multigrid method and the other is based on sine 
transforms and tridiagonal solutions [14]. 

Once again, note that this equation does not uniquely specify a solution 
without further eonstraints. In fact, we ean add any funetion h that satisfy 
Ah = 0 to the solution. Eor this particular problem, the natural boundary 
conditions are (</>J,(/>p ■ n = {p,q) ■ n where n is the normal to the boundary 

^ ~ S) • With these boundary conditions, the solution is still not unique, 

since an arbitrary eonstant ean be added to 0* without ehanging the funetional. 
To get a unique solution, one can fix arbitrary height at some point. 

8 Experimental Results and Conclusions 

When solving (14) we have to specify a boundary condition and an initial value. 
The initial value is for all i,j in the domain. The boundary condition is 

the update rule from one iteration to the next along the boundary of the domain. 
Sinee we have no apriori knowledge of the boundary values, we will consider two 
methods for updating the boundary values between iterations. 

If we have additional knowledge on the desired p and q we may be able to 
incorporate this knowledge into the boundary condition. For example, if the 
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desired pattern can be rolled along the x and y axis up to form the surface of a 
three dimensional torus or donut, periodic boundary conditions can be used. 

In periodic boundary conditions, the boundary value in the next iteration is 
taken from the computed values along the opposite boundary. For the case of 
an image I of size N x N, the update rule is = Pi n-i^Pi^ ~ = 

Pn -1 i’P’^z — P 2 ,i- the boundary of is updated in a similar manner. 
Periodic boundary conditions will perform well for periodic shapes, but this is 
hardly the general case. In the general case, if we knew Py,Qx on the bound- 
ary we could update the boundary values by integration p(l, j) = p(l, j — 1) + 
p{N,j) = p{N,j - 1) -f J^_^py{NJ)dt, p(i, 1) = p{i,2) - 

Py{i,t)dt, p{i,N) = p{i,N — 1) -|- fl^_iPy{i,t)dt. The values of q can be 
similarly updated. The integration can be numerically approximated by the 
trapezoidal rule. 

We now turn to the problem of approximating the derivatives in the update 
equations. We can approximate qx{i,l),qx{i,^),i = 2 . . .N — 1 by central dif- 
ference formula. The values of qx{^,j),qx{J^,j) can be approximated by forward 
or backward formulas. We can now use the fact that p and q satisfy Py = qx on 
the boundary to compute py needed for the above computation. 

To evaluate the algorithm it would be desirable to synthesize moire patterns 
whose optimal (j) and is known. We could then compare the optimal (p and V' 
with the functions found by the minimization process. 

Finding optimal (f> and tp for arbitrary moire patterns require exhaustive 
search over a function space. Such search is, in the general case, clearly imprac- 
tical. However, for certain moire patterns, optimal synthesis can be computed 
without exhaustive search. An example of such moire patterns is linear patterns. 
From the structure of the optimality criterion it is clear that the optimal 
should be constant throughout the image. The optimal (p should therefore be a 
linear image. 

To find the optimal p we proceed as follows. Every linear function p is char- 
acterized by two parameters (p, q) = Vp. Finding the optimal p reduces in this 
case to evaluating the performance criteria over The optimal p is not unique 
since the performance criteria is symmetric with respect to reflection around the 
lines X = 0,y = 0,y = x,y = ~x. In other words, F(po,9o) = = 

V{—po,qo) and so on. We then check the solution found by the iterative scheme. 
Starting from an initial condition of {p,q) = (0, 1) and using the boundary con- 
ditions we arrive at the solution whose gradient is identical to one of the optimal 
gradients. The initial values are shown in Fig. 5 and the values at iteration 20 
are shown in Fig. 6. The dashed vector represent the gradient of the original 
linear image and the solid vectors represent the computed Vp. 

It is interesting to start with two functions, create a superposition and feed 
this superposition to the iterative procedure. In general, the solution will not be 
the same as the two original functions. The reason for this is that probably the 
superposition we started with was not optimal or that the algorithm converge 
to a different local minimum. 
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Fig. 5. Linear moire, initial condition Fig. 6. Linear moire, iteration 20 

However, we succeeded in calculating the two original functions in the follow- 
ing case. We start with two ellipses whose centers are shifted along the x axis. 
The equations for two such ellipses are — h p- = — h p- = k^- 

The indicial equation is h — k = p By elimination of h and k from these 
equations and after some rearrangements the following equation is obtained [4] 
— (^ 2 p 2 / 4 )_{, 2^2 = 1 which represents a hyperbola parameterized by p. Indeed, 
when we used the synthesis algorithm to produce hyperbolic moire patterns, such 
ellipses were found. In Fig. 7 results are shown for natural boundary conditions. 
The iterative process converged fast in our experiments. Usually after about 200 
iterations, there was no apparent change in the images. The parameter A allows 
controlling the “smoothness” of the solution. The results in Figures 8 were ob- 
tained for A = 200. Compare these images with Figure 9 which was obtained for 
A = 1000. 

A periodic profile of a face image is shown in Figure 10. The computed 4> 
and Ip and the superposition image is shown in Figures 11,12. 

The suggested visibility criterion seem to produce good results especially 
in simple cases such as Fig. 7. In more complicated images (such as the face 
images) the optimization algorithm seem to converge to a local minimum and 
the final result depends on the initial conditions. Although in the general case 
the boundary condition is unknown, experimental results show that this affects 
only solution pixels near the boundary. 

The results of this work suggest another application area for moire synthsis. 
If the desired pattern is smooth, the two original images bear little or no resem- 
blance to the desired pattern. The desired pattern is created by the nonlinear 
superposition from both images. Moire pattern synthesis may then be used for 
some sort of visual cryptography. Instead of transmitting the image on an unse- 
cured channel, it is possible to transmit two images which create a moire pattern 
of the desired pattern. However, note that for non-smooth images such as the 
face image areas of discontinuities may disclose the boundary of the face in ip 
and (p. 

A method for visual cryptography for binary images has been proposed in 
[15] that allows perfect reconstruction but the reconstructed image is half the 
resolution of the transmitted images. Extending this method for gray-level im- 
ages will require transmitting two very large images. In moire synthesis, perfect 
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Fig. 9. Result for A = 1000 



Fig. 10. Periodic profile of a face image 



The first grating 



The first grating, iteration:24 



The second grating 



The superposed gratings 



The second grating 



The superposed gratings 



The second grating 



The superposed gratings 



The first grating, iteration:24 



b 

Fig. 11. tp and if> computed for Fig. 10 



Fig. 12. The superposition of the im- 
ages in Fig. 11 



Fig. 7. Hyperbolic patterns, results for Fig. 8. Result for A = 200 

natural boundary conditions 
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reconstruction is not possible. However, as is seen in the previous examples, it 
is often easy to recognize the pattern from the superposition. Our performance 
criteria were designed for visibility of the moire patterns. Other applications, 
such as true visual cryptography certainly require different performance criteria. 
The optimization scheme, however, may remain the same. 

In our scheme for moire synthesis the superposition image consists of low 
frequency and high frequency components. The low frequency components rep- 
resent the desired pattern and zero order moire patterns. The high frequency 
components represent the (1,1) moire terms which were “pushed” outside the 
visibility circle and higher order moires. We can therefore apply low-pass filter- 
ing to the superposition image to enhance the desired pattern. If in addition, 
the periodic profile does not have a DC term, zero order moire do not exist and 
hence we expect better reconstruction. 
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Abstract. We have developed a new stochastic image rendering method for the 
compression, description and segmentation of images. This paintbrush-like im- 
age transformation is based on a random searching to insert brush-strokes into a 
generated image at decreasing scale of brush-sizes, without predefined models 
or interaction. We introduced a sequential multiscale image decomposition 
method, based on simulated rectangular-shaped paintbrush strokes. The result- 
ing images look like good-quality paintings with well-defined contours, at an 
acceptable distortion compared to the original image. The image can be de- 
scribed with the parameters of the consecutive paintbrush strokes, resulting in a 
parameter-series that can be used for compression. The painting process can be 
applied for image representation, segmentation and contour detection. Our 
original method is based on stochastic exhaustive searching which takes a long 
time of convergence. In this paper we propose a modified algorithm of speed up 
of about 2x where the faster convergence is supported by a dynamic Metropolis 
Hastings rule. 



1 Introduction 

Images can be interpreted in several ways by decomposition into basic functions: 
strokes [4, 5], fractals [1], etc. Each of these is natural in some sense: strokes are good 
representations of letters or shapes, Gabor-functions [8] are natural for human sensa- 
tion, fractals originate from the self-similarity. When we look at some images or im- 
age scenes, we usually search for familiar features. This is exploited in fine arts: small 
details are sometimes neglected while the main features are enhanced. 

Nonlinear partial differential equations [6] can be used for enhancing the main im- 
age structure. When compressing the image, anisotropic diffusion, based on the scale- 
space paradigm, can enhance the basic image features to get visually better quality 
[ 10 ]. 
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Good effects of structured diffusion can be achieved by Gibbs reaction-diffusion 
method of [14]. A set of about dozen filters is applied to build a potential function for 
minimization through a stochastic process. Surprisingly good restoring and hiding 
effects can be demonstrated. This method needs some prior model and it results in 
partly isotropic morphology. 

In our previous paper [11] we introduced a method to follow a painter’s process by 
using simplified artificial strokes. It is not an image-filtering to get a painting-like 
isotropic transformation. Other image-painting methods of computer graphics deal 
with model- or edge-controlled methods [e.g. 3,7]. These methods can serve as inter- 
esting visual effects (e.g. impressionist style, brush-splines), but they are different 
from the quality of a ‘real-scene’ painter (e.g. old-fashion portrait painters or S. Dali). 
Our first goal is to simulate the painting process to get a picture similar to a ‘real- 
scene’ painting where the purpose of the painter is to portray something which looks 
like to be a real scenery. We deal with a method copying the real visual world into a 
pleasant form rather than with the artistic interpretation of some special style. Small 
articles are elaborated with fine brushes, while plain surfaces are painted with greater 
strokes. On the other hand, the sequential parameters of the consecutive strokes can be 
applied for image description or some moderate compression as well. We can see that 
our method enhances the main features, which can easily be followed by eye. The 
strokes guide our sight. 

However, this stochastic stroke-searching and -rendering are exhaustive processes. 
Any new trial (a new stroke with the color in question) will be accepted if it reduces 
the distortion. Positioning of a stroke is a random proposal (to avoid ramble-generated 
“structures”), while the accept/reject decision of the proposed stroke is controlled by 
the given stage only, namely it is a Markov Chain process. 

By selecting an appropriate Monte Carlo Markov Chain (MCMC) process, the ex- 
haustive searching can be replaced with a random decision based on density- 
approximation of the distribution of the distortion error of strokes. The problem is 
with the target density: it should be a dynamic definition to avoid the return to the 
exhaustive search or to fall into the first local minima. 



2 The Basic Concepts of the Paintbrush Algorithm [11] 

Here we would like to take the human behavior and sensation when looking at and 
reproducing the visual word into considerations. When looking at an image, there is a 
very usual way of understanding the visual information: 

• Looking at the main outline of the image, 

• Looking for the objects, 

• Finding small details, 

• Relaxed scanning (wandering) on the image. 

When interpreting the image content in a visual form, such as representational 
painting, the re-creation of the image can proceed a similar way: 
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• Outlining the main areas, 

• Elaborating objects, 

• Refining the small details. 

In both cases (understanding and reproducing) there is a well-defined scale-space 
line. First the large details, then the finer details are proceeded. In this aspect the rep- 
resentational painting is similar to the anisotropic diffusion that is based on the scale- 
space theory [6] based on PDEs. However, there is a very important difference be- 
tween the two scale-space approaches, namely: 

Anisotropic diffusion enhances the main edges but smoothes the others, while the 
present paintbrush transformation has sharp edges at any stage. We get sharp details 
even in the case of small or low-contrast areas. In the latter sense representational 
painting gives better segmentation than the anisotropic diffusion. The effect is similar 
to that of Markov Random Field segmentation [2] . However, MRF has the drawback 
that the possible number of “colors” is limited to achieve good segmentation in a finite 
time. 

The other important viewpoint is the question of the finest details of an image. 
When the image contains too many fine details (e.g. sharp photo), our sight may be 
disturbed by the unnecessary small details. If the image is compressed previously by a 
function-set of e.g. Cosine or Wavelet functions, then the image results in annoying 
artifacts in the range of fine details. We can remove the annoying fine information 
contents at the cost of smoothing the sharp segment-contours. On the contrary, we see 
a practical effect when we are looking at an image from a distance: 

• In case of artwork painting, the visual effect could be perfect, giving high contrast 
to represent the objects; 

• Getting closer to the painting, we can see sharp edges (paintbrush-strokes), but 
there is a point where there are no more fine details behind it. 

• We can relax when looking at the image, since there are no details, which are 
quivering when scanning through. 

When defining what we expect from a new process that follows the main features 
of representational painting, we can define the main concepts of the new algorithm: 

1. It should have sharp edges at any level of image-construction; 

2. There are no fine details below a limit; 

3. There are sharp edges at the finest level as well; 

4. From a given distance the image must give the same visual scenery as the origi- 
nal. 

In the following space we can see that the above constraints can be fulfilled by a 
method, which follows the generation of painting by using different sizes of paint- 
brush strokes. 



3 Our Previous Algorithm 

Main steps of the original algorithm of [11] to generate paintbrush strokes follow: 
1 ) Starting with the rawest brush-size; 
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2) or Selecting the next, smaller brush-size ( d); 

• If the finest scale is over then, Goto 14; 

3) C^g: Convolving the 1 input image by brush-distributions depending on orienta- 
tion (p and brush-size & At 5 brush-size we have at least 8 C^g maps due to the <p 
orientations. This map-series is necessary to estimate the brush-color anywhere in 
the image without further brush-stroke tuning. 

4) D: Difference image between the original I and the present iterated stage X; 

5) A: Absolute or square values of the D difference image to get a distortion-map; 

6) If the error-summation over A converges too slowly 

• then Goto 2; 

7) E: Convolving the A error image by a smoothing due to the diameter of the 5 
brush-size; 

8) Calculating histogram of the distortion-map E, defining a threshold value where 
the probability of higher errors is e; 

9) Randomly choose a x,y position in the image and a (p brush-orientation; 

10) If the distortion value ofE at the pixel position ofx,y is in the upper region of the 
distortion-histogram with probability e, then 

• give the color of C^g(x,y) to the brush-stroke centered at position (x,y) with the 
actual cp orientation and 6 brush-size; 

• or, in case of large strokes, give the color of the majority vote in the brush-area 
of the original I image to the brush-stroke centered at position (x,y); 

11 ) Cover the painted image by the pattern of this brush-stroke; 

12) If the error- summation between the original and present stage image over the 
area of this stroke is decreasing, 

• then accept the new stroke; 

• else reject and restore the previous stage in the stroke-area; 

13) If the counter of the brush-strokes is smaller than a limit-number, 

• then Goto 9; 

• else Goto 4; 

14) Ready. 

After completing the rendering process, redundant strokes (fully covered by con- 
secutive strokes) are eliminated from the code of the stroke-series. It usually results in 
a 20-50% decrease of the number of mounted strokes. Presently, the number of the 
possible q> stroke-orientations is 8. 

We can see that our algorithm is stochastic, error-controlled and multiscale. In our 
present experiments brush-strokes are simple rectangles (see Figure 1). Since strokes 
need relatively great convolutions, the C,pg maps are generated in the Fourier domain. 
If the process is not random enough (forcing any prior structural constraints when 
placing a stroke, as considering and modeling ‘edgy’ places), the result may suffer 
from structural side-effects, like strong and disturbing contours. 

The method has been tested for several parameter sets and test images. Some of the 
resulted images can be found in Figure 2. The “painted” images seemed to have high 
quality at the last phase with the finest brush. However, at every stage the edges are 
sharp and the patterns and textures are appropriate. It is interesting to note that the 
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sharp edges and patterns can be generated without any a priori definitions of contours 
or textured places. 




Fig. 1. Painting strokes with border (left) and without border (right) to demonstrate the non- 
structured random searching of the process (detail of test image “Barbara”). 



This method demonstrates that a fully random searching process generating brush- 
strokes can result in a high quality and pleasant-to-see image. This method takes about 
10-30 minutes on a Pentium III PC when generating a 512x512 image. However, it 
can be easily implemented in parallel processor-arrays at high speed, like MRF-based 
algorithms in [12]. Since the above method is a ‘brute-force’ algorithm with an ex- 
haustive stochastic searching, any development in the convergence speed and com- 
pactness of the generated code of the stroke-series may help in the applications. 



4 Generating Series of Strokes Constituting a Markov Chain 
with Respect to the Metropolis-Hastings [9] Rule 

When generating strokes to cover a patch on the image, the process is random; there is 
a target-density to constrain for the bounding errors and the proposal density depends 
on the characteristics of the painting process. First, we overview some Metropolis 
Hastings algorithms that we may use in the sequel. 

The paintbrush rendering as originally described in [11] is a brute-force algorithm 
when considering its goal-function, error-bound and convergence. Applying MH 
methods, we hope to reach better-defined algorithms, with explicable effects and dis- 
tributions and well-proven convergence properties. 




Original image Intermediate, coarser brush-size Finished image 

with small hrush-size 



Fig. 2. Painting stages at different brush-sizes of image “Lena”, “Leopard”, and “Goldhill 
(Number of the scales of brush-strokes was up to 10). 



The goal of the painting process is to get a stroke-patched image which differs from 
the original one by less than a minimum mean-error, the difference being smaller than 
some minimum for any stroke-position. It can be best described by using an appropri- 
ate target density for the similarity-error over the individual strokes. 

The individual event X generates a paintbrush stroke at a position (coordinate 
of its reference point and the tilting angle) and with color . The distortion 

error between X and the original image in the area of X is = e { x ^*^. 
The position S is randomly generated and may or may not depend on the previous 
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stage and the overall image-parameters, while color is defined hy the mean or 

majority color behind the proposed stroke in the original (reference) image. The target 

distribution f is the goal-function of f — 

With the above notation, we can generate the series of X s using the following 
algorithms; 



PB-MH Algorithm: 

Given the stroke of the t-th iteration, 

1. Generate ~ I ) 

2. Take 






T, with probability 


p(x^‘\Y,) 


with probability 


1-p(x«,fJ 



Where 






f( y)q(x\y) 
f(x)q( y\x) 




( 1 ) 



The algorithm starts with the error-bound of the proposed stroke-approximations as 
target density f . The conditional density q{y\x) is a random position-generator of 

S , related to the overall controlling error/edge map and the previous X stroke. 

Independent PB-MH 

Given X^*^ 

1. Generate ~ ^(y) (generated from the error-map and/or edge-map of the 

input image) 

2. Take 

(;+i) Y, with probability 
otherwise 



' f(YJg(xY^) 
\f(x^‘^)g{YY 



(2) 



Here the proposal density is a random position-generator of S , independ- 
ently of the previous position. This might be uniform on the whole im- 



age. 
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Random Walk PB-MH 

Given 

1. Generate ~ 

2. Take 



^ (^+ 1 ) 



y, with probability 



x^''' otherwise 



mm 



J(x^ ) 



( 3 ) 



Here the proposal density is a random position-generator of 

related to the previous S position. 

The original exhaustive searching algorithm in [11] operates with the ‘Random 
walk’ or ‘Independent’ proposal positioning, while in the accept/reject method f is a 
simple threshold on E , ( f{x ) = /(e(Z )) ), where ) and /{Y^ ) are consid- 



ered on the proposed position, and X and indicate the whole image at dif- 

ferent painted stages. When trying probabilistic accept/reject strategies in the same 
framework, we cannot get better convergence. As examples, we tested proposals as 



^ (?+l) 



I with probability 



otherwise 



^ E(x^‘'> 
\e(Y,) ’ 



or 



(^+1) _ 



{ T, with probability 
x^‘^ otherwise 



max ) - E(Y, j , l} , o} 



but the resulted quality and/or the convergence speed were much poorer than the sim- 
ple exhaustive searching. We stepped toward a fully stochastic, MCMC-based algo- 
rithm. 



5 Considering Statistical Tendencies of Densities of Distortion 

The above algorithm [11] generates nice painting-like images, but the process needs 
optimization to get better speed. We can simply apply some quality constraints for 

E(x^‘'' ) — E(Yi )> £ > 0 to force the acceptance of greater changes, but e could de- 
pend on the image and the current situation, causing a loss of most of the proposals as 
in Fig 5(left). Practically, choosing e=0.01 may improve the compression ratio or 
convergence speed for several images. However, we should find some more self- 
calibrated manner for acceptance. 
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Distortion Error of PB Strokes vs. Originai image 




■!j-r~ococDa)<Nio 

1- T- T- T- CN CN 



-Distortion Error of 
Proposed Strokes 
(Coarse PB) 

-Distortion Error of 
Proposed Strokes 
(Fine PB) 

-Distortion Error of 
Accepted Strokes 
(Coarse PB) 

-Distortion Error of 
Accepted Strokes 
(Fine PB) 



% of Distortion 



Fig. 3. Probability densities of distortion error of the proposed ( ) and accepted 

= F,) strokes versus the original image when generating the strokes for ‘Barbara’ 
image. First, coarse (20x5), finally, fine (7x2) strokes are generated. 



In the following space, X (previous stage) and (proposed stage) denote state 

images with events of new strokes at position S‘ and on the state image 

(painted by previous strokes as well). 

We define distortion errors against the original image I : 

. E(x^‘^ ) = d(x^‘\ l) is the distortion error between the original and the pre- 
viously painted stage image, 

. E(Y,hD(Y,,l) is the distortion error between the original and the proposed 






painted stage Y ^ , 

) is the distortion error between the original and the 



last before painted stages. 

Z) (. , . ) is defined as the mean square error, normalized by the actual area of the 
stroke into the [0.0, 1 .0] interval. 

Figure 3 shows the plot of statistics of the proposed and accepted D(r„/) val- 



ues. First coarser, later finer strokes are generated, in the meantime the average error 
rate is decreasing. We can see that the distribution / {d(Y^ , I )) converges to a Dirac- 
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delta at zero for the accepted ( = Y^) proposals. If we use any of the density 

functions of Figure 3 for a Metropolis-Hastings approach of eq. (1), the process does 
not work: 

• If we use a long-term statistics of f {Y, ^ = f {D (Y^ , I , we get the same 

or much slower convergence than the exhaustive searching, since it does not fit to 
the acceptation rule; 

• If we use a short-term statistics or the final f{Xi^ = f{l){Y^, ^ ))accepted 



Dirac-delta), the process stops at the beginning or stops at a low quality. 

Applying the above statistics anyhow, we have the problem that Metropolis- 
Hastings MCMC converges relatively slowly, while f {D (Y^ , I )) and 

will change considerably. So, the convergence and the basic idea of 
the process fail during the iterations. The other problem is that we must accept only 
proposals which decreases the overall error, otherwise the image will be unstructured, 
as in Fig 5(right). 

In our new development of process, the following constraints should be kept: 

• Accept error reduction only; 

• The previous event and the present proposal of different proposed stroke-position 
are compared instead of distortion errors in the same place of the proposed stroke; 

• The MCMC density-approximation is dynamic, following the changing target den- 
sity. 

Testing several experiments and on the above constraints it is clear that not the 
Z)(. , .) terms, but their difference is important in the optimization process. Now we 
define two error differences: 

Dijf(Y,) = E(x^‘>)-E(YJ 

Dijf(x^‘>) = E(x^‘-^>)-E(x^‘>) 

Figure 4 shows the statistics of the proposed (different cases are nearly the same) 
and the accepted (two cases) Dijf(Y^ j events, as the function of the relative error- 
rate. 

It is clearly demonstrated that density f[piff(Y^ ) accepted ) changing considerably 

through the iterations, while probability of the higher values of the proposed 
Diff(Y^ j should decrease in time to get better convergence. It means that the sto- 
chastic rendering should consider the higher distortion values with a higher probabil- 
ity. Since density f{pijf(Y^ ) proposed ) nearly a Dirac-delta, containing mostly low 



Dijf(Yi ) values, it can be supposed that the proposed dynamic target density is 

• /(r,)=0 , so it is rejected; 

• (y^ ) is progressive for higher Dijf(Y^ ) values; 
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• m 



should be normalized by the variance of the current distribution of 



Dijf(Yj ) values to follow the narrowing of the distribution. 



Difference of Distortion Errors at the 
Proposed/Accepted Positions 




- Difference of Distortion 
Errors of Accepted 
Strokes (Coarse PB) 
vs. previous values 

- Difference of Distortion 
Errors of Accepted 
Strokes (Fine PB) vs. 
previous values 

- Difference of Distortion 
Errors of Proposed 
Strokes vs. previous 
values 



Fig. 4. Probability densities of difference btw distortion errors of the proposed and accepted 
strokes and the distortion on previous area of the stroke (Dijf(Y^ )), when generating the 
strokes for ‘Barbara’ image. First, coarse (20x5), finally, fine (7x2) strokes are generated. 




Fig. 5. Erroneous results of rendering when (left) Only strokes of high quality increase are 
accepted, Dijf(Y^ j > 0.1; (right) The acceptance of a stroke may increase the overall error- 
rate. It is nearly the same as a process of fully random positioning, proposal and acceptance. 
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On the above considerations, we suppose the next accept/reject rule: 



Where 



Tj with probability 




with probability 


l-p(x«,7,) 



/ \ . \ f{y) q{x\y) ,1 



(4) 





Diff(Y,)<0 

else 



fix)' 






^Var, 



Here Var^ is the variance of Dijf(Y^ jin a neighborhood time interval around t. 



Since Vur^ is changing slowly by time, it can be eliminated (and neglected) 
from yO(x, y) in eq. 4. Although /(.) is the target density, it cannot be measured as 
a result of long-term statistics, since Var^ causes dynamic windowing effect on the 
output histogram, as Figure 6 demonstrates it. 

Applying eq. 4, we tested it with different exponents q to reach the same image 
quality in PSNR at the same eonstraints (starting and stopping conditions, brush-sizes 
ete.). The results for image Barbara (using 3 different sizes of strokes) can be found in 
Table 1. We have tested many other images and parameter settings, with very similar 
conelusions. Tables 2 and 3 demonstrate some results for two 512x512 color images. 
Tables 4 shows a test of ‘Lena’ by using 10 different sizes of strokes. Exhaustive 
search of e=0.01 also may give good compression result, but at the cost of twice ef- 
fort, while MCMC at q=2.5 shows a good compromise. 

We ean see that eq. 4 gives an effeetive algorithm for image rendering at q=2 ...3. 
Convergence speed is much better because of the smaller number of proposing and 
drawing events, at a considerable gain in the resulted number of paintbrush strokes. 

We have also tested the role of the proposal density q( y | X j . Usually, it is sym- 
metrical, or it is very close to be symmetrical. Step 10 in the algorithm of Section 3 
may be asymmetrieal, but its outcome is not too important. So, usually, we can con- 
sider q( y\x)= q( x\y ). 

The proposal method can also be important in some cases. Applying ‘Random 
walk’ with small neighborhood instead of ‘Independent’ method with any sizes of 
steps, textured images can better converge. 

In a recent research we develop the above method to eonsider MRF-conform 
neighborhood ealculus into the energy term to run a structured MRF-like painting 
algorithm in a MCMC estimation series. 
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Paintbrush Ratio Accepted/Proposed (Barbara) 




Fig. 6. Narrowing ejfect of the distribution of Dijf(Yj ) and the measured statistics of the 
conditional probability (left) Plot of ratio of measured histograms of the accepted and proposed 
distributions p[Diff(Y, ) 

accepted )ip[DmY,) proposed ) the S&IH6 sizG of sti'okc S6ri6s, show 

ing the change in the conditional variance; (right) The above ratio of histograms for two differ- 
ent brush-sizes in whole-runs; For less optimized cases the monotonic function becomes a 
simple thresholding. 



Table 1. Painting results for ‘Barbara 512x512’ at a standard D(Y^ , I )=25.5dB PSNR final 
distortion error rate. Counts are in thousand events. 



Method 


Number of non- 
redundant PB 
strokes 
# thousand 


Number of all drawn 
strokes 

^accepted ^ 


Number of all 
proposed strokes 

^ ) proposed ) 


Diff(Y,)>0 


35 


61 


2645 


(exhaustive search) 
q=l 


29 


40 


2380 


q=2 


26 


31 


1555 


q=3 


28 


34 


1200 



6 Conclusions 

We have developed a dynamic approximation of target density for image rendering. A 
Metropolis-Hastings rule is proposed where the target density dynamically fits the 
process when the characterizing densities are changing through the iterations. This 
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method results in better convergence (lower number of proposed and accepted strokes) 
and better compression (lower number of non-redundant strokes). The above MCMC- 
based stochastic search algorithm is quite different from other searching & matching 
methods, like matching pursuit [8] or effect-oriented paintings [7]. 



Table 2. Painting results for a 512x512 color image (detail of http://www.sztaki.hu/~sziranyi/ 
Sz-Barbara.jpg) with room, flowers and a girl at a standard D(Y, , /)=26.5dB PSNR final 
distortion error rate. Counts are in thousand events. 



Method 


Number of non- 


Number of all drawn strokes 


Number of all pro- 




redundant PB 
strokes 
# thousand 


^accepted ^ 


posed strokes 
^ ) proposed ) 


Diff(Y,)>0 


32 


52 


6490 


(exhaustive search) 
q=l 


26 


34 


2975 


q=2 


24 


27 


1630 


q=3 


19 


22 


3445 


Table 3. Painting results for a 512x512 color image with a boy and color 


books, at a standard 


D\Y^ , 1 )=26.6dB PSNR final distortion error rate. Counts are in thousand events. 


Method 


Number of non- 


Number of all drawn strokes 


Number of all pro- 




redundant PB strokes 
# thousand 


^accepted ^ 


posed strokes 
^ ^ proposed ) 


Diff(Y,)>0 


9 


16 


760 


(exhaustive search) 
q=l 


7.4 


12 


860 


q=2 


7 


9.6 


560 


q=3 


7.4 


8.7 


495 


Table 4. Painting results for ‘Lena 512x512’ at a standard D{Y^ ’ ^) = 


:30.45dB PSNR final 


distortion error rate. Counts are in thousand events. 




Method 


Number of non- 


Number of all drawn strokes 


Number of all pro- 




redundant PB strokes 
# thousand 


^accepted ^ 


posed strokes 
^ ^ proposed } 


DiJf(Y,)>0 


28 


41 


1369 


(exhaustive search) 

Dijf(Y,) >0.01 


11 


11 


2463 


(exhaustive search) 
q=1.5 


17 


20 


2441 


q=2.5 


14 


14 


1504 
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Abstract. In this paper, we derive a complete set of Zernike moment 
correlation functions used to capture spatial structure of a color texture. The set 
of moment correlation functions is grouped into moment correlation matrices to 
be used in illumination invariant recognition of color texture. For any change in 
the illumination, the moment correlation matrices are related by a linear 
transformation. Circular and non-circular correlations are discussed and 
comparisons with a previously suggested color covariance functions have been 
carried out using about 600 different illuminations and rotations textured 
images. Using moment correlation matrices in the invariant recognition of color 
texture, the process can promise in high computation efficiency as well as 
recognition accuracy. The derived correlation invariants is proposed as a 
general formalism that can be used directly with other kinds of complex 
moments, e.g. Fourier Mellin, pseudo Zernike, disc-harmonic coefficients, and 
wavelet moments, to obtain moment correlation based invariants. 



1. Introduction 

Early image recognition algorithms were based on computing (geometric) invariant 
features for gray-level intensity images. The goal was to detect an object or classify a 
textured image from an image database (gray-level and binary image recognition is 
still dominant in computer vision and pattern recognition applications). Despite of the 
increase in dimensionality, the use of colors is unavoidable in recent recognition 
applications. In fact, using color images may give a better recognition performance 
than gray-level images due the capability of capturing local and global image features 
within and between color bands. Moreover, it is not possible to perform illumination 
invariant recognition without using color properties of an image. 

Many techniques had been suggested to investigate the use of multi bands of a 
color image to achieve geometry, illumination, or illumination-geometry invariant 
recognition. First, the work of Swain and Ballard [1] in which they showed that color 
distributions can be used directly for recognition without even paying attention to the 
spatial structure of the image. Their method, however, fails if the illumination spectral 
is changed or the spatial structure of the image is high (it is possible for regions with 
significantly different spatial structure to have similar color distributions). A Color 
Indexing color constancy algorithm [2] was developed to remove the dependency of 
color distributions on illumination changes. The algorithm performs well for an object 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 216-231, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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recognition task but with less success when the image is highly structured as in 
textures. The other group of color image recognition algorithms deals with computing 
spatial structure based features, some of the methods are Gabor filters [3], color 
distributions of spatially filtered images [4], Markov random field models [5] & [6], 
and spatial covariance functions [7]. Moment invariants of color covariance functions 
[7] (or as the authors called them color correlation functions but see Rosenfield [8] for 
the exact terminology which copes with the one we claim in this paper) within and 
between bands of a color image had been used to recognize three-dimensional 
textures. The same color covariance functions had been used successfully in a series 
of illumination recognition experiments of 2-D color texture [9]-[ll]. Jain [10] used 
color covariance functions to recognize multispectral satellite images. In [1 1], Zemike 
moment invariants were computed for color covariance functions, the derived Zemike 
moment invariants, however, were not complete. In this paper a complete set of 
Zemike moment correlation and covariance matrices is derived. Different color 
correlations are introduced, circular and non-circular. Experimental results using 
about 600 different illumination-rotation images are used to compare the proposed 
model to the previously suggested color covariance functions. 



2. Spatial Interaction within and between Color Bands 

To be able to recognize the texture of a color image, the interaction within and 
between its bands is considered in this paper. The spatial covariance family functions 
forms one of the most reliable schemes used to model the color texture. In this paper 
we will discuss four different measures of these covariance functions. 



2.1 Spatial Covariance Functions 

Over the image region defining the texture, each band / (a, p) is assumed wide- 

sense stationary and each pair of bands is assumed jointly wide-sense stationary. The 
set of covariance functions within and between sensor bands (l < i, j < N) is defined 
as 



C,px, y) =£{[/, (a, P)-L][I ^a + x, p + y)-L]} ( 1 ) 

where 7. and /. denote spatial means and E denotes the expected value. For the 
trichromatic case N = 3 v/e observe the following properties; 

• The definition given in (1) will lead to nine covariance functions that include three 
autocovariance functions and six crosscovariance functions. All the nine spatial 
covariance functions have the following property c.. (x, y) = Cj. (-x, - y) in which 

only the autocovariance functions are symmetric about the origin. Therefore, we 
can make use of this symmetry to reduce computations. 

• The crosscovariance functions are not symmetric; however, only three should be 

computed i.e. the other three (C^, , Cj, , ) can be obtained 
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using c,j (x, y) = C (-x, - 3') ■ It was shown in [7] that it is useful to use only 
the basic six covariance functions. 

• Considering two surfaces S and s' oriented arbitrarily in space where C..(x) 
and C'(x) are the corresponding covariance functions and { x = [x y]^ )• From 

[7], those covariance functions are related by a linear coordinate transform M 
as C',(X) = C^(M^). 

• Values of spatial covariance functions may be negative, zero, or positive. 

• If the illumination between the corresponding textures changes, then the relation 
between their corresponding covariance functions changes. Suppose a textured 
surface observed at two different orientations in space under different 
illumination conditions, also suppose that the covariance functions are arranged 
into a column vector c, (x) , then we group the covariance functions into a 

covariance matrix as C(x) = [C,(x) C^(x) ...Cg(x)]- Following [11], let C(x) 
be the covariance matrix of the surface corresponding to the illumination Z(A) 
and C'(x) be the covariance matrix for the same surface after an orientation 
change described by M and under illumination l(X), then C(Mx) = C'(x)E . 
Where E is a 6x6 matrix with elements that depend on /(A) and T(A). 
Therefore, for a change in illumination and orientation the covariance matrices 
are related by a linear transformation E and a linear coordinate transformation 
M . 

The above covariance functions had been used successfully in the recognition of 
color texture. In all previous works, however, they considered that all crosscovariance 
functions are symmetric (the fact they were enforced to be symmetric). One reason is 
the high degree of pixel-to-pixel correlation between different bands belonging to the 
same image, which leads to a very small symmetric error. 



2.2 Spatial Correlation Functions 

Here we assume again that over the image region defining the texture, each band 
I -(a, P) is wide-sense stationary and each pair of bands is assumed jointly wide-sense 

stationary. We define a set of correlation functions within and between sensor bands 
(l<i,j<N) as 



R.j(x, y) = E[ipa, /3)/.(a + x,/3 + y)] , (2) 

where E denotes the expected value. For the trichromatic case N = 3 , correlation 
functions will have the same properties as those given for covariance functions except 
that correlation functions will always have positive values. Positive values of 
correlation functions are necessary when used with moments since moments should 
be computed for nonnegative bounded functions. In the previous work of Kondepudy 
and Healey [7], they used the absolute value of color covariance functions to 
eliminate the negative values. This may in turn destroy the color covariance functions 
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and the transform between the original image and its corresponding test image may be 
non linear or cannot be predicted. 



2.3 Circular Correlation and Circular Covariance Functions 



It is important to define a third group of correlation functions that capture some 
circular symmetric properties when the texture region is averaged within and between 
sensor bands. One way to do this is by averaging or estimating color correlations (or 
covariances) inside a circular region. We define circular correlations within and 
between sensor bands as: 

‘ 0 elsewhere 



where 31 is the radius of the region to compute the expectation value at. Similarly, 
circular covariance functions are defined by: 

^ \E{[p{a,P)-l][ipa + x,p + y)-l^] a^ + P^<3i 

^ij J 

0 elsewhere 



Circular color correlations and color covariances should give better results due to 
the ability to capture the same amount of information as an image is rotated by an 
angle. Experimental results discussed later will show which of the four proposed 
covariances schemes outperform the others. Figure 1 shows the cloth image 
photographed with five different illuminations. To consider the effect of different 
color correlation and color covariance functions, the entire sets of covariance and 
correlation functions family are used to represent the cloth image under white 
illumination and are demonstrated in Figs. 2-5. 




Fig. 1. The image of cloth under five different illuminations. From left to right, used 
illuminations are; white, red, green, blue, and yellow. 

It is our task to show that images shown in Fig. 1, and another 25 cloth images (for 
each illumination) at different rotation angles belong to the same original class. In 
Figs. 2-5, we see that the spatial correlation and spatial covariance functions draws a 
surface that is to be recognized (instead of the original multispectral image). That 
surface may take the shape of a pyramid like shape or a cone like shape and may be 
deformed according to the combination of geometry and illumination changes. For 
this recognition process and how much those shapes are changing, we shall use the 
method of moment invariants (specifically Zemike moments). 
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Fig. 2. (a) Correlation 
functions of cloth image 
under white illumination. 
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20 
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Fig. 3. (a) Covariance 
functions of cloth image 
under white illumination. 



x-axis 
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Fig. 3. (b) Covariance 
functions of cloth image 
under white illumination. 
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Fig. 5. Circular covariance 
functions of cloth image 
under white illumination. 
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3. Zernike Moments of Correlation and Covariance Functions 

Zernike moments [12] may be used to produce one of the most reliable feature set 
used to achieve invariant pattern recognition, see [13]-[15]. It is our purpose to 
compute Zernike moments of covariance functions and correlation functions to obtain 
invariant features that will be used to recognize the color texture. The complex 
Zernike moments of f(x, y) are defined as: 

^ni=^ \\ dxdy f{x,y)[V„^{r,9)t = {2^^)" , (5) 

where the integration is taken inside the unit disk +y^ <1, n = 0, 1, 2, . . °° is the 
order and I is the repetition which takes on positive and negative integer values 
subject to the conditions n-\l\ is even, and |/| < n ■ Note that, (r, 9) are the complex 

Zernike polynomials given as = R^i(r)exp(il9) see [12] for their complete 
definitions. Zernike moments where I takes negative values can be obtained by 
making use of the complex conjugate property, which is =Z*_, ■ In this work, 

Zernike moments will be used to generate invariants of color correlation and color 
covariance functions. These invariants are of great importance since they reduce the 
redundant feature of correlations and covariances. For one specific autocorrelation or 
crosscorrelation function that correspond to one set of (i,j) value, Zernike moments 
is computed as 

2"=^ I I R,ix,y) [F,(r,0)]‘ > 

+y^ <1 

and for obtaining Zernike moments of color covariances just use C,j (x, y) instead of 
R-j{x, y) in the above equation. 



3.1 Construction of Zernike Moment Correlation Invariants 

Let be the correlation matrix of the surface corresponding to the illumination 
l(X) and let R'(x) be the correlation matrix for the same surface after an orientation 
change described by M and under illumination T(X), then 

R(Mx)=R'(x)E, (7) 

where E is the previously mentioned 6x6 matrix that depends on the illuminations 
l(X) and T (X) ■ Using the simple representation 

R,=t,e,,Rr (8) 

/=! 

for k =1,2,...,6 ■ Up to the order n and repetition I, Zernike moments of spatial 
correlation functions may be computed for both sides of (8) resulting in: 
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’ ( 9 ) 

(=1 

where Z‘ is the moment of R (Mx) for yt = 1 2 6 and Z" is the moment of 

R'{x) ■ The invariants (p and are constructed as functions to Zemike moments, 
these invariants may be represented as: 

<p«t=<p{K)’ (10) 



(I, li (11) 

where « = 1 , 2 ,... is the index (or the order) of the invariant and is related (usually) to 
the order of Zernike moments n ■ It is desired that 

Vuk=(p'uk^ ( 12 ) 

will enable us to recognize the desired features independent of illumination and 
geometry changes. To eonstruct a complete set of Zemike moment invariants (see 
[12] and [16]) different moment orders may be used to cancel the phase and obtain 
rotation invariant feature. One way to do this is by taking the magnitude of Zemike 
moments; the second is using phase eaneellation technique. Consider the following 
combined moment form: 



Wuk=KiSKiJ^ 





f <> 


A 




( 6 


\ 


II 


E^’ir 


Z'' 

«iC 




Ect 


Z", 

«i'i 






) 




(,=i 


) 



( 13 ) 



( 14 ) 



where (•) is the complex conjugate. And for the rules of generating the numerical 
values of and that will cancel the phase between the combined moments 

see [12] and/or [16]. Without loss of generality, lets assume that (n^>n^)- The 
invariants are thus we can precisely write 

(15) 

and = Re{i/rjj} which is obtained by expanding (14), this yields: 



-p: = E 4 Re { z;; (z;;,^ )* } + E E Re { z:[,^ (z;j^ )• + z:i (z;;, ^ )• } ■ 



( 16 ) 



1=1 J=i+1 



Wang and Healey [11] computed Zemike moment invariant matriees for the 
special case n, = and some other minor differences. Many in-between invariants 
were neglected though they belong to the same order or less; therefore, the system of 
invariants that they were using was not complete. A complete system of invariants 
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can be generated in (16). It is also possible to derive the pseudo Zemike invariants as 
= Im{i//„J and = Im{i/r'J that will be given as 



^ = z;' (Z4)*}+X Xv,.im{z:-^(Z4)*+z;((Z4)-}- (17) 



1=1 j=i+l 



The invariants given in (16) are a general form in which many other invariants can 
be utilized (depending on other types of moment kernel functions) for instance, 
pseudo Zernike moments [13], orthogonal Fourier Mellin moments [17], disc- 
harmonic coefficients moments [18], and wavelet moments [20]. In fact, we are of 
interest in the invariants given in (16) and it is important to obtain a form that separate 
e.j elements so that to obtain illumination invariance. For this purpose, we have 

rearranged (16) to be given as: 

21 

(p'uk = E ’ (18) 

m=l 

where 

Re{ Z:”(Z'“ )•} 

g = 

Re{ z::;(z:-)*+z:-(z:^)*} 

and 



1 < m < 6 
7 <m<21 



(19) 




1 < OT < 6 

7 < 21 



( 20 ) 



where the and number set is given by (i„ ={(1,2),(1,3),(1,4),(1,5),(1,6), 



(2, 3), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6), (4, 5), (4, 6), (5, 6) } for all m = 7, 8, . . . , 2 1 
respectively, i.e. (*;,;;) = (1, 2), ... (i,„^,) =(5,6) - 



It is obvious that (18) takes the form of a matrix multiplication, representing (18) 
in a matrix form as cp' = GH and from (12) we know that <p=(p' therefore q>= GH . 
Lets assume using a total of w invariants for which m = 1,2....,w and in our discussed 
case it = 1, 2, . . ., 6 , the matrices are 
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Obviously, q) is a w x 6 sized matrix which is the moment invariant matrix (it is 
translation-rotation invariant), G is a vvx21 sized matrix, and H is a 21x6 sized 
matrix with elements that depend only on l{X) and T{X) which represent the effect of 
illumination. For texture recognition, G is represented using an orthonormal bases 
obtained by a singular value decomposition method [19] as follows 

G = USV, (22) 

where U is a vrx21 sized matrix, U = [u,, U2, ..., U21] having columns that are 
orthonormal eigenvectors of GG^ , E is a 21x21 diagonal matrix of singular values 
X^,X^, ...,X^^, and V is a 21x21 matrix having columns that are orthonormal 
eigenvectors of G^G ■ For recognition purpose, the following distance function can 
be used; 



D = + II’ (23) 

)=1 

where (p^,(p^, are the column vectors of q). The above distance function 

characterizes how well the column vectors of q) can be approximated as a linear 
combination of the columns of U ■ Thus, the smallest value of D for matrices q> and 
G will correspond to textures related by some combination of rotation and 
illumination changes. In our work, the matrix q) is used to store the feature of the 
original database under white illumination. The matrix G is used for the texture under 
recognition (investigation), i.e., the texture that had undergoes illumination and 
geometry changes. To clarify the generation of the matrix G we will give a brief 
description; first, generate the six color correlation functions using (2), compute 
Zernike moments for each of the correlation functions using (6), and generate the 
elements of the G matrix using the definition given in (19) and by following the rules 
of generating a complete set of invariants given in [12] and [16]. 



4 . Experimental Results 

In this section we intend to test the color covariance model and the developed color 
correlation model in a texture recognition task. The image database is consisted of 20 
textured images as shown in Fig. 6, which contains some homogenous and 
inhomogeneous textures. For each image class in the database, we generated five 
image samples under white, red, green, blue, and yellow illuminations using HANSA 
color filters and the images are photographed with a Sony CCD camera. For each of 
the five images that have different illuminations, we generated five other rotated 
images at the rotation angles 30°, 60°, 90°, 120° and 150° with respect to the original 
non rotated image. 

Thus for each class, we have a total of thirty images photographed at different 
illuminations and rotations. The whole image database is consisted of 600 images, a 
total of 20 classes with 30 images per class. For each image in the database, color 
covariance and color correlation functions are estimated with averages over a finite 
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image region of size 60x60 pixels and for a finite image lag, i.e., C.j(x,y) and/or 

R..(x,y) is estimated for |x|<16 and |y|<16- It had been suggested in [8] to 

normalize color covariance functions against intensity changes by dividing by 
C^^(0, 0) • We will include this normalization scheme in our tests to see whether it is 

useful or not, (0,0) will be used for normalizing correlation functions. 

The test is divided into two stages, the training phase for feature extraction of the 
original image class and the testing phase that includes feature extraction of the image 
under investigation that has illumination and geometry changes with respect to the 
original image class. In the training phase and after computing color covariances 
and/or color correlations, For comparison purpose, Zernike moment invariant 
matrices are computed for each image up to the 6^,, 8,,,, 10^, and 12^^ orders. The 
training process is performed to each of the 20 color textured images photographed 
under white illumination and non rotated image, and all the 20 Zemike moment 
invariant matrices are stored to be used offline. In the testing phase, the unknown 
textures are extracted from the rest of the 580 images under different illuminations 
and rotations. The distance function defined in (23) is used as a similarity measure. 
The recognition performance is measured as the number of correct matches over the 
total number of images. See Fig. 7 for comparison purpose. 




Fig. 6. The original image database used in our experiments, from left to right the first row 
shows; carpet, carpet back, ceiling tile, coffee grounded and jungle. Second row; algae, ground 
leaf, wallpaper, leaf, and cloth. Third row; cotton canvas, forest, forest cutting, fur, and water. 
Fourth row; granite 1, granite2, granite3, chrome, and wood. 
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Fig. 7. Performance comparison of using Zernike moment correlation matrices and Zernike 
moment covariance matrices for the Illumination-rotation invariant rec ognition of color texture. 



The circular correlation functions proposed in this work give the highest 
recognition performance 97%. On the other hand, the recognition performance value 
of using the covariance functions proposed hy Kondepudy and Healey [7] is 85% and 
circular covariance functions gives 87%. As we increase the order of Zernike 
moments, the recognition performance increases for all models. Another test is 
performed using the normalization shows that using the intensity normalization by 
dividing each color covariances by (q, 0 )™d dividing each color correlation by 

R. (0, 0) reduces the recognition performance to less than 80%. 



5. Conclusions 

The spatial correlation functions introduced in this paper is very useful in representing 
and modeling color texture. Compared to a previous color covariance functions the 
recognition performance is much higher. We also derived a complete set of Zernike 
moment invariant correlation and covariance matrices to make correlation functions 
invariant to rotation changes of textures. The recognition performance is increased as 
the moment order is increased. The work also investigates four texture modeling 
functions, ordinary covariance, circular covariance, ordinary correlation, and circular 
correlation. Using Zernike moments the dimensionality of correlation feature is 
reduced and it may be useful to use other kinds of moments for the recognition of 
texture since the derived invariants posses a general form. 
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Abstract. Most cost function based clustering or partitioning methods 
measure the compactness of groups of data. In contrast to this picture 
of a point source in feature space, some data sources are spread out on 
a low-dimensional manifold which is embedded in a high dimensional 
data space. This property is adequately captured by the criterion of 
connectedness which is approximated by graph theoretic partitioning 
methods. 

We propose in this paper a pairwise clustering cost function with a novel 
dissimilarity measure emphasizing connectedness in feature space rather 
than compactness. The connectedness criterion considers two objects as 
similar if there exists a mediating intra cluster path without an edge 
with large cost. The cost function is optimized in a multi-scale fashion. 
This new path based clustering concept is applied to segment textured 
images with strong texture gradients based on dissimilarities between 
image patches. 



1 Introduction 

Partitioning a set of objects into groups and thus extracting the hidden structure 
of the data set is a very important problem which arises in many application 
areas e.g. pattern recognition, exploratory data analysis and computer vision. 
Intuitively, a good grouping solution is characterized by a high degree of homo- 
geneity of the respective clusters. Therefore, the notion of homogeneity must be 
given a mathematically precise meaning which strongly depends on the nature 
of the underlying data. 

In this paper we will deal with an important subclass of partitioning meth- 
ods namely clustering according to pairwise comparisons between objects. Such 
data is usually called proximity or (dis) similarity data respectively. This data 
modality is of particular interest in applications where object (dis)similarities 
can be reliably estimated even when the objects are not elements of a metric 
space. There is a rich variety of clustering approaches developed particularly 
for this data modality in the literature. Most of them fall in the category of 
agglomerative methods [13]. These methods share as a common trait that they 
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start grouping with a configuration composed of exactly one object per cluster 
and then they successively merge the two most similar clusters. Agglomerative 
methods are almost always derived from algorithmic considerations rather than 
on the basis of an optimization principle which often obscures the underlying 
modeling assumptions. 

A systematic approach to pairwise clustering by objective functions as de- 
scribed in [17] is based on an axiomatization of invariance properties and robust- 
ness for data grouping. As a consequence of this axiomatic approach we restrict 
our discussion to intra cluster criteria. Our second important design decision 
for pairwise clustering replaces the pairwise object comparison by a path-based 
dissimilarity measure, thereby emphasizing cluster connectedness. The effective 
dissimilarity between objects is defined as the largest edge cost on the mini- 
mal intra cluster path connecting both objects in feature space. Two objects 
which are assigned to the same cluster are either similar or there exists a set of 
mediating objects such that two consecutive objects in this chain are similar. 

The distinction between compactness and connectedness principles is also 
addressed by two other recently proposed clustering methods. Tishby and Slonim 
[20] introduced a Markovian relaxation dynamics where the Markov transition 
probability is given by object dissimilarities. Iterating such a relaxation dynamics 
effectively connects objects by sums over minimal paths. The method, however, 
does not include the constraint that all considered paths have to be restricted 
to nodes from the same cluster. The other method which was introduced by 
Blatt, Wiseman and Domany [2] simulates the dynamics of a locally connected, 
diluted ferromagnet. The partial order at finite temperature is interpreted as a 
clustering solution in this model. 

2 Pairwise Data Clustering 

Notational Prerequisites: The goal of data clustering is formally given by the 
partitioning of n objects o^, 1 < * < n into k groups, such that some measure of 
intra cluster homogeneity is maximized. The memberships of objects to groups 
can be encoded by a n x fc Matrix M G {0, In this setting the entry 

is set to 1 if and only if the ith object is assigned to cluster iz which implies 
the condition Yl^v=i ~ = 1 . . . u. The set of all assignment matrices 

fulfilling this requirement is denoted in the following by 

The dissimilarity between two objects o, and Oj is represented by D^j. These 
individual dissimilarity values are collected in a matrix D G It is worth 

noting here that many application domains frequently confront the data analyst 
with data which violates the triangle inequality. Moreover, the self-dissimilarity 
of objects often is non vanishing, even negative dissimilarities might occur or 
a certain percentage of dissimilarities is unknown. For our proposed method, 
we only require symmetry, i.e. D^j = Dji. To distinguish between known and 
unknown dissimilarities neighborhood sets Afi, . . . ,J\fn are introduced, i.e., j G 
Afi denotes that the dissimilarity Dij is known. 
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Fig. 1. Prototypical situation for the standard pairwise clustering approach 



The Objective Function: With these notational preliminaries, we are now 
able to address the important modeling step of choosing an appropriate cost 
function. An axiomatization of objective functions for data clustering based on 
invariance and robustness criteria is given in [17]. This approach makes explicit 
the intuitively evident properties that a global shift of the data or a rescaling 
as well as contamination by noise should not sensitively influence the grouping 
solution. In accordance with this approach we will focus on the following cost 
function: 



HP'=(M,D) = EE where 

,.=1 ^=l 

sums up individual contributions d^^ for each object and each group Ci, 
where d^^, stands for the average dissimilarity between and objects belonging 
to cluster 0,^. thus favors intra-cluster compactness. It is the use of this 
normalization that removes the sensitive dependency of the minimum of to 
constant shifts of the dissimilarity values and makes it insensitive to different 
cluster sizes. 

Optimization: The optimization of objective functions like is computa- 
tionally difficult since combinatorial optimization problems of this kind exhibit 
numerous local minima. Furthermore, most of the data partitioning problems 
are proven to be AfT^-hard. For robust optimization, stochastic optimization 
techniques like Simulated Annealing (SA) [11] or Deterministic Annealing (DA) 
([19,7]) have shown to perform satisfactorily on many pattern recognition and 
computer vision applications. Effectively these annealing methods fall in the class 
of homotopy methods with smoothing controlled by a temperature parameter. In 
the zero temperature limit the comparatively fast local optimization algorithm 
known as ICM is obtained [1]. 

Drawbacks of the Approach: So far we have discussed a very powerful ap- 
proach to the pairwise data clustering problem. It is theoretically well founded, 
and showed to be applicable in a wide range of data analysis problems ranging 
from texture segmentation [8] and document clustering [17] to structuring of 
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Fig. 2. Data sets on which the naive intra cluster compactness criterion fails 

genome data bases [7] . An ideal situation for the cluster compactness criterion is 
depicted in figure 1 for a toy data set. All objects have been assigned to clusters 
in an intuitively correct manner. 

However there exist situations, where the exclusive focus on compactness fails 
to capture essential properties of the data. Two prototypical examples are given 
in figure 2. It is clear that the human data analyst expects that the ring like 
structures or the spiral arms are detected as clusters. It is the novel contribution 
of this paper to propose a way of formalizing this goal by determining effective 
inter object dissimilarities while keeping the well approved clustering objective 
function. 

3 Path Based Clustering 

In order to motivate our novel modification of the basic pairwise clustering 
method we will often try to appeal to readers geometric intuition despite the 
fact that the objects do not necessarily reside in a metric space. 

Modeling: As demonstrated before the pairwise clustering approach according 
to is not well suited for the detection of elongated structures in the object 
domain. This objective function sums all intra-cluster dissimilarities whereby 
similar objects will be grouped together regardless of their topological relations. 
This is a good solution as long as the data of one group can be interpreted as a 
scatter around a single centroid (see figure 1). But if the data is a scatter of a 
curve (c.f. figure 2) or a surface pairwise clustering as defined by will fail. 
In this case the effective dissimilarity of two objects should be mediated by peer 
objects along that curve, no matter how large the extend of that structure may 
be. 

Assume that the objects Oj and Oj belong to the same manifold. Then with 
high probability there exists a path from Oi to Oj over other objects of this 
manifold such that the dissimilarities of two consecutive objects are small. The 
reason for this is that the density of objects coming from one source is expected 
to be comparatively high. On the other hand if Oi and Oj belong to different, 
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non-intersecting manifolds then all paths from to Oj have with high probability 
at least one pair of consecutive objects with high dissimilarity. 

As clusters are defined by their coherence one is only interested in paths 
through the object domain which traverse regions of high density. The part of 
a path where the density is lowest should thus determine the overall costs of 
this path. In other words the maximum dissimilarity along a path determines 
its cost. In order to formalize our ideas about mediating the dissimilarity of any 
two objects by peers in the same cluster we define the effective dissimilarity 
between objects Oj and Oj to be the length of the minimal connecting path. 

<(M,D) = {.eu^rpl-il < (2) 



V^j(M) = <^ p e {!,... ,n}' 



3u : Mp[h]„ = lAl<nA p[l] = i A p[l] = j 



h=l 



is the set of all paths from Oj to Oj through cluster i/ if Oj and Oj belong to cluster 
If both objects belong to different clusters Py(M) is the empty set and the 
effective dissimilarity is not defined. 

With the new definition of the effective dissimilarity we are able to define 
the objective function for path based clustering. It has the same functional form 
as for pairwise clustering. 



EE 



1—1 



vviieie Uiiy — 



Thereby the desirable properties of shift and scale invariance of the pairwise 
clustering cost function are conserved: 

Vc, Do G K : argmin7fP*’(M, D) = argmin7fP*’(M, cD Dq), (4) 

meM meM 



since M c,Dq e K holds 'ffP’^(M, cD -|- Do) = c7fP’^(M, D) -|- NDq. 



Optimization by Iterated Conditional Mode (ICM): Finding the mini- 
mum of HP*’(M, D) has a high computational complexity. There are many differ- 
ent methods known to avoid a complete search in the assignment configuration 
space (|A4| = fc"). A very effective and simple method is called iterated condi- 
tional mode [1]. ICM assigns an object to a cluster under the condition that all 
other assignments are kept fix. In algorithm 1 the function changes 

the 7r[i]-th row of the assignment matrix M by replacing it with the pth unit 
vector. This so called single site update is iterated over all objects in a random 
manner. A complete cycle of visiting all sites is called a sweep. As a common site 
visitation schedule an arbitrary permutation tt of all objects is generated before 
each sweep in order to avoid local minima due to a fixed visitation order. The 




240 



Bernd Fischer, Thomas Zoller, and Joachim M. Buhmann 



Algorithm 1 Iterated conditional mode (ICM) for Path Based Clustering 
Require: dissimilarity matrix D 
number of objects n 
number of clusters k 

Ensure: argminj^^^ D) with high probability 

choose M randomly 

repeat 

7T = perm({l , . . . , n}) 
for all i € n} do 

I/* = argmin,,g{i . e^), D) 

where s^[j](M, e,/) assigns object 7r[f] to cluster u 
M = s^[i](M, ) 

end for 
until converged 
return M 



algorithm repeats the sweeps until convergence is reached, i.e. no assignment is 
changed. During each update step the objects are assigned to clusters such that 
the costs of the resulting configuration are minimal. Therefore it is guaranteed 
that the costs, or energy respectively, decreases in each sweep and ICM will 
terminate after a finite number of cycles. 

Critical for the running time is the update step. Here the recalculations 
of the dissimilarity matrix dominate the complexity due to the fact that this 
computational effort is necessary during each assignment update. Basically what 
has to be solved is an ALL-PAIRS-SHDRTEST-PATH problem for each group of 
objects. For a full graph the algorithm of Floyd has a running time of 0{n^) 
[3]. If object Oi is updated, the ICM algorithm tries to find the minimum of the 
local costs by hypothetically assigning the given object to the various groups. 
Therefore, the effective dissimilarity to each cluster has to be determined once 
with object o, inside and once with Oi outside the cluster. Thus 2k different 
effective dissimilarity matrices are needed. In the next paragraph an efficient 
implementation of this update step is presented. 

Efficient Implementation of Update Step: One observes that the instances 
of the ALL-PAIRS-SHORTEST-PATH problems in two consecutive update steps are 
almost the same. They differ only in one single object: that object, which is 
to be updated. So it makes sense to check if one effective dissimilarity matrix 
can be used as a starting point for the calculation of an effective dissimilarity 
matrices in the next update step. Fortunately, those k dissimilarity matrices 
which correspond to the current configuration are used again in the next step. 
What about the other k matrices? 

For fc — 1 of them a possible new configuration is given by adding object Oj, 
whereas for one cluster the new costs without Oi have to be computed. Consider 
the first case: object o.j is to be inserted in a certain group. For reasons of 
computational efficiency a complete recalculation of the effective dissimilarity 
matrix of that cluster is to be avoided. A closer look at Floyds algorithm leads 
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(a) (b) 

Fig. 3. The graphs show the predecessor graph of object Os before (a) and after (b) 
object Oi is removed from the cluster. For those objects where the path to Os does not 
lead over Oi the effective dissimilarity may not change 

the way: Its first step is given by the initialization of the matrix with the direct 
distances between the objects which in our case is given by the input dissimilarity 
matrix. The goal is to get the shortest path distance between each pair of objects. 
For any object Floyds algorithm tries to put it on a path between each pair of 
objects. If now Oj is put into a new cluster only the last iteration of the Floyd 
algorithm has to be performed in which the current object is put on each of the 
existing paths between entity pairs. This step has a running time of 0{n?). In 
order to do so one has to get effective dissimilarities between and all other 
objects in the considered cluster. Because of the symmetry of the original input 
dissimilarity matrix the effective dissimilarities are again symmetric. For that 
reason it is sufficient to solve one SINGLE-SOURCE-SHQRTEST-PATH problem in 
order to arrive at the effective dissimilarities between Oi and all other objects. 
This can be solved with the Dijkstra’s algorithm in a running time of 0{n? log n) 
[3]. So far we can compute the update step with an overall running time of 
0(v? logn). 

It remains to describe the necessary recalculations in the case where Oj is to 
be removed from the cluster it has been assigned to in the previous configura- 
tion. Again this goal has to be reached with the least possible computational 
effort. However in this situation the asymptotic running time of 0{n^) for a 
complete recalculation of dissimilarities for the whole group of objects can not 
be decreased. Nevertheless there is a good heuristic for lowering the running 
time. If there exists a shortest path from Og to ot which does not lead over Oi, 
the effective dissimilarity Dgt will not change if is removed from the cluster. 
One can obtain a predecessor matrix for all objects in a given cluster in the 
same running time as it takes to compute the effective dissimilarities. For each 
object we can thus determine the shortest path tree to all other objects in a 
running time of 0{n). If we have the shortest path tree from Og then only those 
dissimilarities between Og and another object ot have to be updated where t is 
an index out of the set of all objects in the subtree with root o* (see figure 3). 
The total running time for an assignment update step is thus 0(n^). 
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Multi-scale Optimization: As can be seen from the previous section the pro- 
posed optimization scheme is computationally highly demanding. An improve- 
ment in performance as well as solution quality is reached by using multi-scale 
techniques. The idea of multi-scale optimization is to lower the complexity by 
decreasing the number of considered entities in the object space. For example in 
image processing one can compute the given optimization tasks on a pyramid of 
different resolution levels of the image. At the first stage of such a procedure one 
solves the given computational problem on the subsampled image with lowest 
resolution, maps the result to the next finer resolution level and starts the opti- 
mization again. The expectation is that the result from the coarser level provides 
a reasonable starting point for the next finer level in the sense that convergence 
to a good solution can be reached within a few iterations. The probability of 
obtaining the global minimum is raised even if a local optimization scheme is 
used and the running time will dramatically decrease. A general overview of 
multi-scale optimization and its mathematically rigorous description is given in 
[15]. In order to pursue this approach a proper initialization of the coarse levels 
is needed. To this end three kinds of mappings have to be defined: 

First of all, a function 7^ is needed, which maps the objects from the finer 
level I to the next coarser one (level 7+1). The multi-scale operator 7^ is defined 
as a function 

(5) 

where is the set of objects on the 7*^ resolution level and O® is the set 
of objects on the finest level. So far this is just a formal definition. There is 
indeed a large design freedom in choosing the concrete form of this mapping. In 
the general case subsuming highly similar objects is reasonable. For our most 
prominent application task texture segmentation however we can use the natural 
topology of square neighborhoods of image sites in order to determine the fine 
to coarse mapping. 

Second, the input dissimilarity matrix for the coarser level has to be defined 
in such a way that the basic modeling assumptions which underly the objective 
function are not violated. If the objects which map to one single entity in the 
coarser level belong to different clusters, the newly formed super-object in the 
coarser level should belong to one of these. Otherwise two completely different 
groups will get closer in the coarser level and the structure of the data is lost. 

For that reason a mapping between each object in the coarse level and a 
representative object in the next finer level has to be defined as the third ingre- 
dient of the multi-scale approach. The function will denote the representative 
object in level 7 for each object in level 7+1. 

7?^:0^+^->0^ with R\o,) e {o,\l\o,) = oi] (6) 

One possibility of defining such a representative is to choose the object nearest 
to the center of mass of the set. We are now able to define the input dissimilarity 
matrix on the (7 + 1)*^ level. 

R>^+\lJ) = T>^{R\oi),R\oj)) 



( 7 ) 
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Fig. 4. Illustration of the multi-scale coarsening operators: I is mapping the objects 
on the finer level to the corresponding super-object in the coarser level, R is back- 
projecting to the representative in the finer level in order to enable the computation 
of the coarse level dissimilarity matrix 



Having assembled all the necessary parts which constitute the resolution 
pyramid, the optimization can proceed as described in the beginning of this 
section. The intermediate coarse to fine mappings of computational results is 
achieved in a straightforward manner. A pictorial description of the multi-scale 
pyramid construction is given in figure 4. 

4 Texture Segmentation by Pairwise Data Clustering 

In order to pose the segmentation of images according to texture content as a 
pairwise data clustering problem, we follow the approach of Hofmann et al. [8]. A 
suitable image representation for texture segmentation is given by a multi-scale 
family of Gabor filters 

which are known to have good discriminatory power for a wide range of textures 
[8] [10]. In good agreement with psychophysical experiments [4] these Gabor 
filters extract feature information in the spatial correlation of the texture. The 
moduli of a bank of such filters at three different scales with octave spacing 
and four orientations are used as the representation of the input image. Thus 
the resulting twelve dimensional vector of modulus values /(x) for each image 
location x comprises our basic texture features. 

By its very nature, texture is a non local property. Although /(x) contains 
information about spatial relations of pixels in the neighborhood of x, it may not 
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Fig. 5. Results of the novel path based clustering criterion on three toy data sets: a) 
arcs b) circular structure c) spiral arms 



suffice to grasp the complete characterization of the prevalent texture. Therefore 
the image is covered with a regular grid of image sites. Suppose a suitable binning 
t* = 0 < h < . . . < II is given. For each site i,i = 1 . . . N the empirical feature 
distribution ^ is computed for all Gabor channels r in order to arrive at a 
texture description for these spatially extend image patches. The dissimilarity 
between textures at location i and j is then computed independently for each of 
the channels by a statistic. 

nW _ 2 _ \ik) — 

where = [ft\tk) + /j'’^(tfe)]/2. (9) 

III order to combine the different values D- into one dissimilarity value for each 

pair of objects the Li-norm is used: comparative study 

shows, this norm outperforms all norms in the Lp family [18]. 

5 Results 

Artificial Data Sets: In section 2 some drawbacks of the formerly developed 
clustering method were addressed. As the novel path based approach was es- 
pecially designed to cure such deficits, we first demonstrate its performance on 
some artificially generated data sets which would pose challenging problems for 
the conventional method. In figure 5 the results for three different toy data sets 
are depicted. Evidently the elongated structures apparent in these data sets are 
grouped in an intuitively appealing manner by our new algorithm. 

Recently another interesting paradigm for pairwise data clustering has been 
proposed by Fred [5]. In this agglomerative procedure clusters are combined 
if the dissimilarity increment between neighboring objects is sufficiently small 
according to some smoothness requirement. Our and Fred’s results are depicted 
in figure 6. In contrast to the approach in [5] the outer ring like structure of 
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(a) (b) 

Fig. 6. Comparison between path based pairwise clustering (a) and the most compet- 
itive agglomerative procedure (b) 




Fig. 7. Multidimensional scaling visualization of texture histograms illustrating the 
texture drift phenomenon: a) frontal b) tilted view on five textures 



this data set does not constitute a cluster in the sense of our objective function 
and we have, therefore, inferred a solution with seven groups. Apart from this 
deviation, our method is able to match the competitors performance. 

Texture Segmentation: A core application domain for pairwise data cluster- 
ing techniques is the unsupervised segmentation of textured images. In order to 
obtain test images with known ground truth a set of mixtures of five textures 
each, so called Mondrians, has been constructed on the basis of micro textures 
from the Brodatz album. 

Before we come to discuss our results a word about the motivation for path 
based clustering for texture segmentation is in order. Textures in real world im- 
ages often exhibit a gradient or drift of the describing features due to perspective 
distortions. This lack of translational and rotational invariance has been recog- 
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(a) (b) (c) (d) 

Fig. 8. Segmentation results for texture Mondrians a) input image b) ACM c) PWC 
d) Path Based PWC 



nized early by Gibson [6]. To our knowledge, Lee et al. [14] were the first to 
address this problem as an important factor in designing models for texture 
segmentation. The issue is illustrated by figure 7. Here the texture histograms 
have been treated as vectors. In order to visualize their structural properties the 
dimensionality reduction technique multi-dimensional scaling (MDS) has been 
applied to construct a low dimensional embedding of the given vectors while 
faithfully preserving the inter vector distances. Figure 7 has been generated by 
using a deterministic annealing implementation of MDS as described in [12]. The 
left figure shows the case of a frontal view on five exemplary textures whereas 
the right one depicts the histogram embedding of the same textures when tilting 
the viewing angle. Clearly, the distorted textures form elongated structures in 
the feature space, whereas the non-inclined ones are characterized by compact 
groups. 

In order to give an impression of the performance of the novel path based 
clustering approach three typical grouping solutions on mixtures of tilted Bro- 
datz textures are shown in figure 8. For comparison the results of the conven- 
tional pairwise clustering approach (PWC) and another recently proposed and 
highly competitive histogram clustering method known as Asymmetric Clus- 
tering Model (ACM) [16] are also shown. All of these results were reached by 
multi-scale techniques. In this context it is interesting to shed some light on the 
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(a) (b) (c) (d) (e) 

Fig. 9. Segmentation results for the texture Mondrian from example 8(a) with Mini- 
mum Spanning Tree Clustering using a) 15 cluster b) 26 c) 42 d) 145 and e) 210 cluster 



different topologies in the objeet and spatial domain. Whereas the novel elus- 
tering algorithm presumes a certain topological structure in the object realm, 
namely that of elongated data sources, the spatial relations of two-dimensional 
images yield another interesting starting point in terms of the multi-scale coars- 
ening strategy. Evidently spatially neighboring sites on the image grid are likely 
to belong to the same texture category. Therefore combining four by four regions 
of image sites can be used for defining the multi-scale pyramid in the case of 
texture segmentation. 

As can be seen in figure 8, path based pairwise data clustering clearly outper- 
forms its competitors on this testbed. The results show that the other algorithms 
tend to split perceptually homogenous regions due to the fact that they cannot 
handle texture gradients properly. Moreover the image regions in which an edge 
between adjacent textures occurs are notoriously difficult to handle. The mixed 
statistics of such image parts do not pose so much of a difficulty for our novel 
algorithm because often there are links in terms of mediating paths to textures 
on either side of the edge. Thus the region of concern will then be adjoined to 
one of these instead of being considered as a group in its own right. However in 
some cases (c.f. the seeond row in figure 8) even the novel method fails to achieve 
the expected results. Thus the problem of mixed statistics can not be consid- 
ered completely resolved. Another interesting example is given by the last row 
of figure 8. Here the same texture has been used twiee, once in the upper part 
and once on the right side of the Mondrian. Our method groups these two re- 
gions together thereby recognizing the textural similarity whereas the competing 
approaches separate this texture in different segments. 

Path Based Clustering is related, but not identical to the agglomerative min- 
imum spanning tree clustering algorithm (MST) (c.f. [9]). If outliers are present 
far away from all clusters, MST will put them in single clusters, whereas PBC 
assigns them to the nearest cluster. Figure 9 shows some results of MST applied 
to texture segmentation. The agglomerative algorithm has been stopped at dif- 
ferent levels. The result with 26 clusters, for instance, contains only 3 groups 
with more than 3 elements. The result with 145 clusters is the first to distin- 
guish the 5 different texture segments. However this solution suffers from a large 
amount of noise near the texture boundaries. 
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(a) (b) 

Fig. 10. Visualization of frequencies with which a given edge is lying on an optimal 
path between some objects, a) example one, b) example three of figure 8 



Another interesting insight in our novel approach is given by looking at the 
frequencies with which an edge between objects is lying on an optimal path 
between any two other objects. Such a visualization for the first and last example 
of figure 8 is given in figure 10. Here the objects are given by the image sites 
lying on a homogeneous grid. The darker the depicted edge, the more often this 
particular link has appeared on an optimal path. In the case of well separated 
groups (example 1 of figure 8) the visible edges are all in the interior of the five 
segments. On the other hand there is a number of frequently used edges forming 
a chain which traverses the border between Mondrian segments in the case of 
the merged textures (example three in figure 8). 

Apart from artificially created scenarios the ultimate test of a texture seg- 
mentation method are real-world images. In this case texture drift occurs as 
a natural consequence of perspective distortion. Here some results on photo- 
graphic images taken from the COREL photo gallery are shown in figure 11. 
Again the path based approach to pairwise data grouping performs best. Fur- 
thermore the problem of the competing algorithms with image regions on texture 
edges becomes apparent again. Our novel method yields satisfactory results not 
introducing mixed statistics groups. 

6 Conclusion 

In this contribution a novel algorithm for pairwise data clustering has been pro- 
posed. It enhances the conventional approach by redefining inter object dissim- 
ilarities on the basis of mediating paths between those entities which belong to 
the same group or cluster. Moreover an efficient multi-scale optimization scheme 
for the new clustering approach has been developed. The ability of path based 
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(a) (b) (c) (d) 



Fig. 11. Segmentation results for real world images: a) input image b) ACM c) PWC 
d) Path Based PWC 



pairwise data clustering to generate high quality grouping solutions has been 
demonstrated on artificially created data sets as well as for real world applica- 
tions. 

However we consider the new grouping criterion to be work in progress. First 
of all, the current technique for reducing the number of actually considered 
dissimilarity values is given by simple regular subsampling. Better alternatives in 
the sense of Gibbs sampling or methods based on the histogram of dissimilarities 
should be developed. Furthermore, better stochastic optimization methods for 
our novel clustering approach like simulated annealing have to be formulated 
and embedded in the multi-scale framework. 
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Abstract. This paper presents an iterative maximum likelihood frame- 
work for perceptual grouping. We pose the problem of perceptual group- 
ing as one of pairwise relational clustering. The method is quite generic 
and can be applied to a number of problems including region segmen- 
tation and line-linking. The task is to assign image tokens to clusters 
in which there is strong relational affinity between token pairs. The pa- 
rameters of our model are the cluster memberships and the link weights 
between pairs of tokens. Commencing from a simple probability distri- 
bution for these parameters, we show how they may be estimated using 
an EM-like algorithm. The cluster memberships are estimated using an 
eigendecomposition method. Once the cluster memberships are to hand, 
then the updated link-weights are the expected values of their pairwise 
products. The new method is demonstrated on region segmentation and 
line-segment grouping problems where it is shown to outperform a non- 
iterative eigenclustering method. 



1 Introduction 

Recently, there has been considerable interest in the use of matrix factorisation 
methods for perceptual grouping. These methods can be viewed as drawing their 
inspiration from spectral graph theory [4]. The basic idea is to commence from 
an initial characterisation of the perceptual affinity of different image tokens in 
terms of a matrix of link- weights. Once this matrix is to hand then its eigenval- 
ues and eigenvectors are located. The eigenmodes represent pairwise relational 
clusters which can be used to group the raw perceptual entities together. There 
are several examples of this approach described in the literature. At the level of 
image segmentation, several authors have used algorithms based on the eigen- 
modes of an affinity matrix to iteratively segment image data. One of the best 
known is the normalised cut method of Shi and Malik [16]. Recently, Weiss [17] 
has shown how this, and other closely related methods, can be improved using 
a normalised affinity matrix. At higher level, both Sarkar and Boyer [14] and 
Perona and Freeman [13] have developed matrix factorisation methods for line- 
segment grouping. These non-iterative methods both use the eigenstructure of 
a perceptual affinity matrix to find disjoint subgraphs that represent the main 
arrangements of segmental entities. 
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Although elegant by virtue of their use of matrix factorisation to solve the 
underlying optimization problem, one of the criticisms which can be leveled at 
these methods is that their foundations are not statistical in nature. The aim in 
this paper is to overcome this shortcoming by developing a maximum likelihood 
framework for perceptual grouping. We pose the problem as one of pairwise 
clustering which is parameterised using two sets of indicator variables. The first 
of these are cluster membership variables which indicate to which perceptual 
cluster a segmental entity belongs. The second set of variables are link weights 
which convey the strength of the perceptual relations between pairs of nodes 
in the same cluster. We use these parameters to develop a probabilistic model 
which represents the pairwise clustering of the perceptual entities. We iteratively 
maximise the likelihood of the configuration of pairwise clusters using an EM- 
like algorithm. By casting the log-likelihood function into a matrix setting, we 
are able to estimate the cluster-memberships using matrix factorisation. Once 
these memberships are to hand, then the link-weights may be estimated. 

It is important to stress that although there have been some attempts at 
using probabilistic methods for grouping elsewhere in the literature [11], our 
method has a number of unique features which distinguish it from these alter- 
natives. Early work by Dickson [6] has used Bayes nets to develop a hierarchical 
framework for splitting and merging groups of lines. Cox, Rehg and Hingorani [9] 
have developed a grouping method which combines evidence from the raw edge 
attributes delivered by the Canny edge detector. Leite and Hancock [10] have 
persued similar objectives with the aim of fitting cubic splines to the output of 
a bank of multiscale derivative of Gaussian filters using the EM algorithm. Cas- 
tano and Hutchinson [11] have developed a Bayesian framework for combining 
evidence for different graph-based partitions or groupings of line-segments. The 
method exploits bilateral symmetries. It is based on a frequentist approach over 
the set of partitions of the line-segments and is hence free of parameters. Re- 
cently, Crevier [5] has developed an evidence combining framework for extracting 
chains of colinear line-segments. Our work differs from this work in a number of 
important ways. We use a probabilistic characterisation of the grouping-graph 
based on a matrix of link weights. The goal of computation is to iteratively re- 
cover the maximum likelihood elements of this matrix using the apparatus of 
the EM algorithm. When posed in this way, the resulting iterative process may 
also be regarded as the high-level analogue of a number of low-level iterative 
processes for perceptual grouping. Here several authors have explored the use 
of iterative relaxation style operators for edgel grouping. This approach was pi- 
oneered by Shashua and Ullman [15] and later refined by Guy and Medioni [7] 
among others. Parent and Zucker have shown how co-circularity can be used 
to gauge the compatibility of neighbouring edges [12]. Our method differs from 
these methods by virtue of the fact that it uses a statistical framework rather 
than a goal directed one dictated by considerations from neurobiology. 

The outline of this paper is as follows. Section 2 reviews previous work on 
how matrix factorisation may be applied to the link-weight matrix to perform 
perceptual grouping. In Section 3 we develop our maximum likelihood framework 
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and show how the parameters of the model, namely the cluster membership 
probabilities and the pairwise link-weights can be estimated using an iterative 
EM-like algorithm. In Section 4 we describe how the method can be applied 
to image segmentation while Section 5 describes a second application involving 
motion analysis. Finally, Section 6 concludes the paper by summarising our 
contributions and offering directions for future research. 



2 Grouping by Matrix Factorisation 

We pose the problem of perceptual grouping as that of finding the pairwise 
clusters which exist within a set of image tokens. These objects may be pixels, 
or segmental entities such as corners, lines, curves or regions. However, in this 
paper we focus on the two problems of region segmentation and motion analysis. 
The process of pairwise clustering is somewhat different to the more familiar 
one of central clustering. Whereas central clustering aims to characterise cluster- 
membership using the cluster mean and variance, in pairwise clustering it is link- 
weights between nodes which are used to establish cluster membership. Although 
less well studied than central clustering, there has recently been renewed interest 
in pairwise clustering aimed at placing the method on a more principled footing 
using techniques such as mean-field annealing [8] . 

To commence, we require some formalism. We are interested in grouping 
a set of objects which are abstracted using a weighted graph. The problem 
is characterised by the set of nodes V that represent the objects and the set 
of weighted edges E between the nodes that represent the state of perceptual 
grouping. The aim in grouping is to locate the set of edges that parition the 
node-set V into disjoint and disconnected subsets. If represents one of these 
subsets and 17 is the index-set of different partitions, then V = Kj and 

14 ,' n 14" = 0 if w' 4 Moreover, since the edges partition the node-set 
into disconnected subgraphs, then E n (14' x K;") = 0 if w' 4 We are 
interested in perceptual grouping problems which can be characterised using an 
a \V\ X \V\ matrix of link- weights A. The elements of this matrix convey the 
following meaning in the hard limit 

{ 1 if there exists a partition I 4 

such that i G 14 and j G 14 (1) 

0 otherwise 

In this paper we are interested in how matrix factorisation methods can be used 
to locate the set of edges which partition the nodes. One way of viewing this is as 
the search for the permutation matrix which re-orders the elements of A into non- 
ovelapping blocks. Howevever, when the elements of the matrix A are not binary 
in nature, then this is not a straightforward task. However, Sarkar and Boyer 
[14] have shown how the positive eigenvectors of the matrix of link-weights can 
be used to assign nodes to perceptual clusters. Using the Rayleigh-Ritz theorem, 
they observe that the scalar quantity xMx, where A is the weighted adjacency 
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matrix, is maximised when x is the leading eigenvector of A. Moreover, each of 
the subdominant eigenvectors corresponds to a disjoint perceptual cluster. We 
confine our attention to the same-sign positive eigenvectors (i.e. those whose cor- 
responding eigenvalues are real and positive, and whose components are either 
all positive or are all negative in sign). If a component of a positive eigenvector 
is non-zero, then the corresponding node belongs to the perceptual cluster as- 
sociated with the associated eigenmodes of the weighted adjacency matrix. The 
eigenvalues Ai, A2.... of A are the solutions of the equation \A — A/| = 0 where 
I is the N X N identity matrix. The corresponding eigenvectors X;^^,X;^^, ... are 
found by solving the equation ^x^^. = A^x^. . Let the set of positive same-sign 
eigenvectors be represented by J? = {tulA^,, > 0 A [(x* (i) > OVi) V x* (i) < OVi])}. 
Since the positive eigenvectors are orthogonal, this means that there is only one 
value of ui for which x)j('<) ^ 0. In other words, each node i is associated with a 
unique cluster. We denote the set of nodes assigned to the cluster with modal 
index u> as = {z|x* (f) ^ 0}. 



3 Maximum Likelihood Framework 



In this paper, we are interested in exploiting the factorisation property of Sarkar 
and Boyer [14] to develop a maximum likelihood method for updating the link- 
weight matrix A with the aim of developing a more robust perceptual grouping 
method. We commence by factorising the likelihood of the observed arrangement 
of objects over the set of modal clusters of the link- weight matrix. Since the set 
of modal clusters are disjoint we can write, 

P{A) = n (2) 



where P{^u) is the probability distribution for the set of link- weights belonging 
to the modal-cluster indexed ca. To model the component probability distribu- 
tions, we introduce a cluster membership indicator which models the degree of 
affinity of the object indexed i to the cluster with modal index u>. This is done 
using the magnitudes of the modal co-efficients and we set 



^iii! — 






(3) 



Using these variables, we develop a model of probability distribution for the 
link-weights associated with the individual clusters. We commence by assuming 
that there are putative edges between each pair of nodes {i,j) belonging to the 
cluster. The set of putative edges is = 1^ x 14^ — {{i,i)\i £ U}. We further 
assume that the link-weights belonging to each cluster are independent of one 
another and write 






( 4 ) 
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To proceed, we require a model of probability distribution for the link-weights. 
Here we adopt a model in which the observed link strueture of the pairwise clus- 
ters arises through a Bernoulli distribution. The parameter of this distribution is 
the link-probability Ai,j. The idea behind this model is that any pair of nodes i 
and j may conneet to each with a link. This link is treated as a Bernoulli variable. 
The probability that this link is the correct is A; ^ while the probability that it 
is in error is 1 — A j j- . To gauge the correetness of the link, we check whether the 
nodes i and j belong to the same pairwise cluster. To test for cluster-consistency 
we make use of the quantity Si^Sj^. This is unity if both nodes belong to the 
same cluster and is zero otherwise. Using this switching property, the Bernoulli 
distribution becomes 



This distribution takes on its largest values when either the link weight Aij is 
unity and = Sj^ = 1, or if the link weight Aij = 0 and = Sj^ = 0. 

With these ingredients the log-likelihood function for the observed pattern 
of link weights is 



U ^ ^ ^ ^ Aij T (1 ^^(f Aij) ? (6) 

ioen {ij)(^<i’^^ ^ 

After some algebra to collect terms, the log-likelihood function simplifies to 



£ = 



(^i 






,ln- 



A,, 



1 A^j 



-k ln(l - Ai^j) 



( 7 ) 



Posed in this way the structure of the log-likelihood function is reminiscent 
of that underpinning the expectation-maximisation algorithm. The modes of the 
link-weight matrix play the role of mixing components. The product of cluster- 
membership variables Si^Sj^ plays the role of an a posteriori measurement prob- 
ability. Secondly, the link-weights are the parameters which must be estimated. 
However, there are important differences. The most important of these is that 
the modal clusters are disjoint. As a result there is no mixing between them. 

Based on this observation, we will exploit an EM-like process to update 
the link-weights and the cluster-membership variables. In the “M” step we will 
locate maximum likelihood link-weights. In the “E” step we will use the revised 
link-weight matrix to update the modal clusters. To this end we index the link 
weights and cluster memberships with iteration number and aim to optimise the 
quantity 
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The revised link weight parameters are indexed at iteration n + l while the 
cluster-memberships are indexed at iteration n. 



3.1 Expectation 

To update the cluster-membership variables we have used a gradient-based 
method. We have computed the derivatives of the expected log-likelihood func- 
tion with respect to the cluster-membership variable 
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Since the associated saddle-point equations are not tractable in closed form, 
we use the soft-assign ansatz of Bridle [2] to update the cluster membership 
assignment variables. This involves exponentiating the partial derivatives of the 
expected log-likelihood function in the following manner 
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As a result the update equation for the cluster membership indicator variables 
is 
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3.2 Maximisation 

Once the revised cluster membership variables are to hand then we can apply 
the maximisation step of the algorithm to update the link-weight matrix. The 
updated link-weights are found by computing the derivatives of the expected 
log-likelihood function 



dQ{A^ 1 ^ ) ^ a (») / («) (») 

a 4(n+l) ^ ] *1^ iw 

^ ^ \ 



and solving the saddle-point equations 






dA 



(n+l) 



= 0 



As a result the updated link-weights are given by 



Air” = E 



^iio ^ju) 



( 12 ) 



(13) 



(14) 



In other words, the link- weight for the pair of nodes (t, j) is simply the average of 
the product of individual node cluster memberships over the different perceptual 
clusters. Since each node is associated with a unique cluster, this means that the 
updated affinity matrix is composed of non-overlapping blocks. Moreover, the 
link- weights are are guaranteed to be in the interval [0,1]. 



3.3 Modal Structure of the Updated Link- Weight Matrix 

Once the updated link-weight matrix is to hand, then we can use the modal 
analysis of Sarkar and Boyer to refine the set of clusters. The idea is a simple 
one. For each cluster, we compute an updated link-weight matrix. The leading 
eigenvector of this matrix provides a measure of the affinity of the different nodes 
to the cluster. 

To proceed, we introduce some matrix notation. We commence by represent- 
ing the cluster-memberships of the cluster indexed uj using the column vector 

= (sic^, We also define a |U| x |U| weight-matrix , whose 

elements are 



In 



A, 



(n) 



1-A. 



(n) 



(15) 
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where 



1 if 7 ^ 0 and 7^ 0 
0 otherwise 



With this notation, the algorithm focuses on the quantity 



( 16 ) 



Q(^("+l)|^(n)) = ^ S^W‘^S^ (17) 



In this way the log-likelihood function is decomposed into contributions from the 
distinct modal clusters. Moreover, each cluster weight matrix is disjoint. For 
each such matrix, we will perform a further eigendecomposition to identify the 
foreground and background modal structure. Recall that Sarkar and Boyer [14] 
have shown that the scalar quantity x*Ax, where A is the weighted adjacency 
matrix and x is a vector of cluster-membership variables, is maximised when x 
is the leading eigenvector of A. Unfortunately, we can not exploit this property 
directly. The reasons for this are twofold. First, the utility measure underpinning 
our maximum likelihood algorithm is a sum of terms of the form 1U“ . 

Second, the elements of 1U“ may be negative (since it is computed by taking 
logarithms) and hence its eigenvalues will not be real. We overcome the first of 
these problems by applying the Rayleigh- Ritz theorem to each weight matrix 1U“ 
in turn. Each such matrix represents a distinct cluster and its leading eigenvector 
represents the individual cluster-membership affinities of the nodes. To overcome 
the second problem we make use of the fact that the directions of the eigenvectors 
of the matrices A and In A are identical. We therefore commence from the matrix 
whose elements are 






Ci,j 



A 



(n+1) 



1 “ a: 



(n+l) 



(18) 



We use the components of the leading eigenvector of IU“ to perform “modal 
sharpening” on the cluster memberships. They are re- assigned according to the 
following formula 



„("+!) 

^ioj 



Ki^)\ 



(19) 



Before proceeding, it is worth pausing to consider the relationship between 
this modal analysis and the updated cluster membership variables. Using the 
update formula obtained for the cluster-membership variables given in Equation 
(11), it is a straightforward matter to show that the log-likelhood function is 
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given by 
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where 



=ln 
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(n+l) 



1 - A] 



(n+l) 



( 21 ) 



To cast the log-likelihood function into a matrix setting suppose that 
is the matrix whose entry with row i and column j is Further, suppose 

that x£"^(a|"^^^) = {x^i\x ^2 \ . . . , x^m}'^ is the eigenvector associated with the 
eigenvalue of For this eigenvector, the eigenvalue equation is 

r("+i)x, = a|"+')x, (22) 



Furthermore, the jth component of the eigenvector satisfies the equation 
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feev+ 
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(23) 



If the vector of cluster membership variables sS"^ ) = {s^"^ , , • • • , Sm^ 

is an eigenvector of then we can write 

Collecting terms together 
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(24) 



For the cluster indexed w, the contribution to the log-likelihood function is clearly 
maximised when is the largest eigevalue of (t(”+i). 



3.4 Algorithm Description 

To summarise, the iterative steps of the algorithm are as follows: 
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(e) 



(f) 




(g) 








Fig. 1. Results obtained on synthetic images using the Eigendecompositon method and 
the Softassign. 



— (1) Compute the eignenvectors of the current link-weight matrix Each 
same-sign eigenvector whose eigenvalue is positive represents a disjoint pair- 
wise cluster. The number of such eigenvectors determines the number of 
clusters for the current iteration. This number may vary from iteration to 
iteration. 

— (2) Compute the updated cluster-membership variables using the E-step. 
At this stage modal sharpening amy be performed to improve the cluster- 
structure if desired. This sharpening process may be iterated to refine the 
current set of clusters. 

— (3) Update the link-weights using the M-step to compute the updated link 
weight matrix 

— Goto step (1). 
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4 Grey Scale Image Segmentation 

The first application of our new pairwise clustering method involves segmenting 
grey-scale images into regions. To compute the initial affinity matrix, we use 
the difference in grey-scale values at different pixel sites. Suppose that Qi is the 
grey-scale value at the pixel indexed i and gj is the grey-scale value at the pixel 
indexed j. The corresponding entry in the affinity matrix is 

= exp[-fcg(6T, - gjf\ (25) 

If we are segmenting an i? x C image of R rows and C columns, then the affinity 
matrix is of dimensions RC x RC. This initial characterisation of the affinity 
matrix is similar to that used by Shi and Malik [16] in their normalised-cut 
method of image segmentation. 




Fig. 2. Number of segmentation errors versus noise standard deviation for non- 
iterative eigenecomposition (top curve), soft-assign EM (middle curve) and modal 
sharpening EM (lower curve). 



By applying the clustering method to this initial affinity matrix wc iteratively 
segment the image into regions. Each eigenmode corresponds to a distinct region. 



4.1 Experiments 

We have conducted experiments on both synthetic and real images. We com- 
mence with some examples on synthetic images aimed at establishing some of 
the properties of the resulting image segmentation method. In Figure 1 a) and 
b) we investigate the effect of contrast variations. The left-hand panel shows an 
image which divided into two rectangular regions. Within each region there is 
variation in intensity whose distribution is generated using a Lambertian sphere. 
In the right-hand panel we show the resulting segmentation when the EM-like 
method is used with modal sharpening. There are two modes, i.e. detected re- 
gions. These correspond to the two rectangular regions in the original image. 
There is no fragmentation due to the spherical intensity variation. 
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Next we consider the effect of added random noise. In Figure 1 c), d), e) and 
f) (i.e. the second row) we show a sequence of images in which we have added 
Gaussian noise of zero mean and known standard deviation to the grey-scale 
values in an image containing three rectangular regions. The third row (ie 1 
g, h, i and j) shows the segmentation result obtained with the EM algorithm 
when cluster-memberships are updated using the modal refinement process de- 
scribed in Section 3.3. The fourth row (ie 1 k,l,m and n) shows the segmentation 




Fig. 3. Comparison between the non-iterative eigendecomposition approach and the 
two variants of the EM-like algorithm. 




Fig. 4. Segmentations of grey-scale images. 
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obtained using the EM algorithm with cluster membership update using soft- 
assign. For the different images, standard deviation of the added Gaussian noise 
is 35%, 50%, 65% and 95% of the grey-scale difference between the regions. The 
final segmentations are obtained with an average of 2.3 iterations per cluster. 
The final segmentations contain 3,3,4 and 4 clusters respectively. The method 
begins to fail once the noise exceeds 60%. It is also worth noting that the region 
boundaries and corners are well reconstructed. 

Figure 2 offers a more quantitative evaluation of the segmentation capabil- 
ities of the method. Here we compute the fraction of mislabelled pixels in the 
segmented images as a function of the standard deviation of the added Gaus- 
sian noise. The plot shows three performance curves obtained with a) the non- 
iterative eigendecomposition algorithm, b) the EM-like method with soft-assign 
and c) the EM-like algorithm with modal sharpening. The non-iterative method 
fails abruptly at low noise-levels. The two variants of the EM-like algorithm per- 
form much better, with the modal sharpening method offering a useful margin 
of advantage over the soft-assign method. 

We have repeated the experiments described above for a sequence of syn- 
thetic images in which the density of distractors increases. For each image in 
turn we have computed the number of distractors merged with the foreground 
pattern and the number of foreground line-segments which leak into the back- 
ground. Figures 3 a and b respectively show the fraction of nodes merged with 
the foreground and the number of nodes which leak into the background as a 
function of the number of distractors. The three curves shown in each plot are 
for the non-iterative eigendecomposition method and for the EM-like algorithm 
when both soft-assign and modal sharpening are used to update the cluster 
membership weights. In both cases, the shoulder of the response curve for the 
two variants of the EM-like algorithm occurs at a significantly higher error rate 
than that for the non-iterative eigen-decomposition method. Of the two alterna- 
tive methods for updating the cluster membership weights in the EM algorithm, 
modal sharpening works best. 

To conclude this section, in Figure 4 we provide some example segmentations 
on real-world images. In the Itop row we show the original image, the middle row 
is the segmentation obtained with EM and modal sharpening, while the bottom 
row shows the segmentation obtained with EM and soft-assign. On the whole the 
results are quite promising. The segmentations capture the main region structure 
of the images. Moreover, they are not unduly disturbed by brightness variations 
or texture. The modal sharpening method gives the cleanest segmentations. It 
should be stressed that these results are presented to illustrate the scope offered 
by our new clustering algorithm and not to make any claims concerning its 
utility as a tool for image segmentation. To do so would require comparison and 
sensitivity analysis well beyond the scope of this paper. 

One of the interesting properties of our method is that the number of modes 
or clusters changes with each iteration of the algorithm. This is because we 
perform a new modal analysis each time the link-weight matrix is updated. 
For the segmentation results shown in fig 4, we have investigated how the number 
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of modal clusters varies with iteration number. In Figure 5 we show the number 
of active elusters as a funetion of iteration number for each of the real-world 
images. In eaeh case the number of clusters inerease with iteration number. In 
the best ease the number of elusters stabilizes after 2 iterations (figure 4i), and 
in the worst ease after 6 iterations (figure 41). 

5 Motion Segmentation 

Our second application involves motion segmentation. To compute motion vec- 
tors we have used a single resolution block matching algorithm using spatial/- 
temporal correlation [3], this kind of block matching algorithms are based on 
a predictive search that reduces the eomputational eomplexity and provides a 
relaible performance. The 2D velocity vectors for the extracted motion blocks 
are characterised using a matrix of pairwise similarity weights. Suppose that dj 
and iij are the unit motion veetors for the blocks indexed i and j. The elements 
of this weight matrix are given by 

45 = I l{i + 



5.1 Motion Experiments 

We have conducted experiments on motion sequences with known ground truth. 
In Figures 6a and 6b we show ground truth data for the “Hamburg taxi” and 
“Trevor White” sequenees. The images show the distinet motion eomponents for 
the two scenes. The corresponding raw images are shown in Figures 6e and 6f. 
In both sequences we use 8x8 pixel bloeks to eompute the motion vectors. For 
the “Hamburg taxi” sequenee the motion field is shown in Figure 6d. In Figure 
6e we show the pairwise clusters obtained by applying our EM-like algorithm. 
There are 3 clusters which match closely to the ground truth data shown in 



if j 
otherwise 



(26) 




Fig. 5. Number of clusters per iteration for each of the real-world images in fig. 4. 
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Fig. 6. Results for motion sequences. 



Figure 6a. In fact, the three different clusters correspond to distinct moving 
vehicles in the sequence. Figure 6g shows the corresponding motion field for 
the “Trevor White” sequence. The motion segmentation is shown in Figure 6h. 
There are three clusters which correspond to the head, the right arm, and the 
chest plus left arm. These clusters again match closely to the ground-truth data. 
It is interesting to note that the results are comparable to those reported in 
[1] where a 5 dimensional feature vector and a neural network was used. The 
proposed algorithm converges in an average of four iterations. 

In Table 1 we provide a more quantative analysis of these results. The table 
lists the fraction of the pixels in each region of the ground truth data which are 
misasigned by the clustering algorithm. The best results are obtained for the 
chest-region, the taxi and the far-left car, where the error rate is a few percent. 
For the far-right car and the head of the Trevor White, the error rates are about 
10%. The problems with the far-right cat probably relate to the fact that it is 
close to the periphery of the image. 
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Table 1. Error percentage for the two image sequences. 



Sequence 


Cluster 


% of Error 


Trevor White 


Right arm 


8 % 


Trevor White 


Chest 


6 % 


Trevor White 


Head 


12% 


Ham. Taxi 


Taxi 


4 % 


Ham. Taxi 


Far Left Car 


3 % 


Ham. Taxi 


Far Right Car 


10 % 



6 Conclusions 

In this paper, we have presented a new perceptual clustering algorithm which 
uses an EM-like algorithm to estimate link-weights and cluster membership prob- 
abilities. The method is based on an iterative modal decomposition of the link- 
weight matrix. The modal cluster membership probabilities are modeled using a 
Bernoulli distribution for the link- weights. We apply the method to the problems 
of region segmentation and of motion segmentation. The method appears robust 
to severe levels of background clutter. 
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Abstract. We consider energy minimization problems related to image 
labeling, partitioning, and grouping, which typically show up at mid-level 
stages of computer vision systems. A common feature of these problems 
is their intrinsic combinatorial complexity from an optimization point- 
of-view. Rather than trying to compute the global minimum - a goal we 
consider as elusive in these cases - we wish to design optimization ap- 
proaches which exhibit two relevant properties: First, in each application 
a solution with guaranteed degree of suboptimality can be computed. Sec- 
ondly, the computations are based on clearly defined algorithms which 
do not comprise any (hidden) tuning parameters. 

In this paper, we focus on the second property and introduce a novel 
and general optimization technique to the field of computer vision which 
amounts to compute a suboptimal solution by just solving a convex 
optimization problem. As representative examples, we consider two bi- 
nary quadratic energy functionals related to image labeling and per- 
ceptual grouping. Both problems can be considered as instances of a 
general quadratic functional in binary variables, which is embedded into 
a higher-dimensional space such that suboptimal solutions can be com- 
puted as minima of linear functionals over cones in that space (semidefi- 
nite programs). Extensive numerical results reveal that, on the average, 
suboptimal solutions can be computed which yield a gap below 5% with 
respect to the global optimum in case where this is known. 



1 Introduction 

Many energy-minimization problems in computer vision like image labeling and 
partitioning, perceptual grouping, graph matching etc., involve discrete deci- 
sion variables and therefore are intrinsically combinatorial by nature. Accord- 
ingly, optimization approaches to efficiently compute good minimizers have a 
long history in the literature. Important examples include the seminal paper 
by Geman and Geman [1] on simulated annealing, approaches for suboptimal 
Markov Random Field (MRF) minimization like the IGM-algorithm [2], the 
highest-confidence-lirst heuristic [3], multi-scale approaches [4], and other ap- 
proximations [5,6]. A further important class of approaches comprises continua- 
tion methods like Feelers partitioning approach [7], the graduated- non-convexity 
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strategy by Blake and Zisserman [8], and various deterministic (approximate) 
versions of the annealing approach in applications like surface reconstruction [9] , 
perceptual grouping [10], graph matching [11], or clustering [12], 

Apart from the simulated annealing approach using annealing schedules which 
are unpractically slow for real-world applications (but prescribed by theory, see 
[1]), none of the above-mentioned approaches can guarantee to find the global 
minimum. And in general, this goal is elusive due to the combinatorial complex- 
ity of these minimization problems. Consequently, the important question arises: 
How good is a minimizer computed relative to the unknown global optimum? 
Can a certain quality of solutions in terms of its suboptimality be guaranteed in 
each application? To the best of our knowledge, none of the approaches above 
(apart from simulated annealing) seems to be immune against getting trapped 
in some local minimum and hence does not meet these criteria. 

A further problem relates to the algorithmic properties of these approaches. 
Apart from simple greedy strategies [2,3], most approaches involve some (some- 
times hidden) parameters on which the computed local minimum critically de- 
pends. A typical example is given by the artificial temperature parameter in 
deterministic annealing approaches and the corresponding iterative annealing 
schedule. It is well known [13] that such approaches exhibit complex bifurcation 
phenomena, the transitions of which (that is, which branch to follow) cannot 
be controlled by the user. Furthermore, these approaches involve highly nonlin- 
ear numerical fixed-point iterations which tend to oscillate in a parallel (syn- 
chronous) update mode (see [10, p. 906] and [15]). 

These problems can be avoided by going back to the mathematically well- 
understood class of convex optimization problems. Under mild assumptions there 
exists a global optimum which, in turn, leads to a suboptimal solution of the 
original problem, along with clear algorithms to compute it. Abstracting from the 
computational process, we can simply think of a mapping taking the data to this 
solution. Thus, evidently, no hidden parameter is involved. Concerning global 
energy-minimization problems in computer vision, this has been exploited for 
continuous-valned functions in [16,17], for example, to approximate the classical 
Mumford-Shah functional [18] for image segmentation. 

In this paper, however, we focus on more difficult problems by extending this 
line of research to prototypical energy-minimization problems involving discrete 
decision variables. Our work is based on the seminal paper by Lovasz and Schri- 
jver [19] who showed how tight problem relaxations can be obtained by lifting 
the problem up into some higher-dimensional space and down-projecting to a 
convex set containing feasible solutions in that space. This idea has been put 
forward and lead to a remarkable result by Goemans and Williamson [20], who 
were able to show for a classical combinatorial problem that suboptimal solu- 
tions (for the special problem considered) cannot be worse than 14% relative to 
the unknown global optimum. These two facts - bounds on the suboptimality, 
and algorithm design based on convex optimization - have motivated our work. 

Organisation of the paper. We consider in Section 2 two representatives 
of the class of quadratic functionals in binary variables. This class of mini- 
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mization problems is well-known in the context of image labeling, perceptual 
grouping, MRF-modeling, etc. We derive a problem relaxation leading to a con- 
vex optimization problem in Section 3. The corresponding convex programming 
techniques are sketched in Section 4. In Section 5, we illustrate the properties 
of our approach by describing ground-truth experiments conducted with one- 
dimensional signals, for which the global optimum can be easily computed with 
dynamic programming. Real-world examples are discussed in Section 6, and we 
conclude our paper by indicating further work in Section 7. 

Notation. For a vector y e M", D{y) denotes the diagonal matrix with entries 
j/i, . . . , yn- e denotes the vector of one’s, = 1, Vi, and I = D{e) the unit matrix. 
For a matrix X , D{X) denotes the diagonal matrix with the diagonal elements 
X,,, Vi, of X. 5" denotes the space of symmetric n x n-matrices X^ = X, and 5" 
denotes the matrices JV £ 5" which are positive semidefinite. For abbreviation, 
we will also use the symbol JC = 5"^^. For two matrices X,Y £ S'^, X •¥ = 
trace(Xy) denotes the standard matrix inner product. 

2 Problem Statement: 

Minimizing Binary Quadratic Functionals 

In this paper, we consider the problem to minimize functionals of the general 
form: 

J(x) = x^Qx + 2b^x + const , x £ { — 1, 1}”, Q £ b £ K” . (1) 

In the field of computer vision, such global optimization problems arise in various 
contexts. In the following sections, we give two examples related to image labeling 
and perceptual grouping, respectively. 

Note that apart from symmetry, no further constraints are imposed on the 
matrix Q in (1). Hence, the functional J need not to be convex in general. This 
property along with the integer constraint x* £ { — 1, 1}, i = 1,. . . , n, makes the 
minimization problem (1) intrinsically difficult. 

In Section 3 we will relax some of these hard constraints so as to arrive at a 
convex optimization problem which closely approximates the original one. 

2.1 Example 1: Binary Image Restoration and Labeling 

Consider some scalar-valued feature (grey-value, color feature, texture measure, 
etc.) g : n ^ M. which has been locally computed within the image plane. 
Suppose that for each pixel position i, feature g is known to originate from 
either of two prototypical values ui,U 2 - In practice, of course, g is real- valued 
due to measurement errors and noise. Figure 1 shows an example. 

To restore a discrete-valued image function x from the measurements g, we 
wish to compute x as minimizer of a functional which has the form (1): 

J{x)= ^ ^ {{u 2 - ui)x, + UI + U 2 - 2gif + ^ x, £ {-1, 1}, Vi . 

* (*o> 



( 2 ) 
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Fig. 1. Top: A binary image, heavily corrupted by (real-valued) noise. Bottom, left: 
The original data textured on a plane. Bottom, right: The data as 3D-plot to illustrate 
the poor signal-to-noise ratio. 



Here, the second term sums over all pairwise adjacent variables on the regular 
image grid. 

Functional (2) comprises two terms familiar from many regularization ap- 
proaches [21]: A data- fitting term and a smoothness term modeling spatial con- 
text. However, due to the integer constraint G { — 1,1}, the optimization 
problem considered here is much more difficult than standard regularization 
problems. 

We note further that, depending on the application considered, it might be 
useful to modify the terms in (2), either to model properties of the imaging device 
(data-fitting term) or to take into consideration a priori known spatial regular- 
ities (smoothness term; see, e.g., [22]). These modifications, however, would not 
increase the difficulty of problem (2) from an optimization point-of-view. 



2.2 Example 2: Figure-Ground Discrimination 
and Perceptual Grouping 

Let gi,i = 1, . . . ,n, denote some feature primitive irregularly distributed over 
the image plane. Suppose that for each pair of primitives gi,gj, we can com- 
pute some (dis) similarity measure d^j corresponding to some of the well-known 
“Gestalt laws”, or to some specific object properties learned from examples. For 
instance, gi might denote an edge-element computed at location i in the image 
plane, and might denote some measure corresponding to smooth continua- 
tion, co-circularity, etc. For an overview over various features and strategies for 
perceptual grouping we refer to [23] . 

According to the spatial context modeled by d^j , we wish to separate familiar 
configurations from the (unknown) background. To this end, following [10], we 
label each primitive g^ with a decision variable Xi G ( — 1, 1} (“1” corresponding 
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Fig. 2. Left: An “object”. Middle: The similarity measure for A<^ij G [0,7t),/ 3 = 20, 
according to the expected relative angles, and allowing for some inaccuracies of a 
(fictive) preprocessing stage. Right: The object was rotated by a fixed arbitrary angle, 
and translated and scaled copies have been superimposed by noise. Where are these 
objects? 



to figure, “-1” corresponding to background and noise) and wish to minimize a 
functional of the form (1): 

J[x) = ^(A - d^j)x^Xj + 2^(An - x* G {-1, 1}, Vf . (3) 

(*.j> » j 

Figure 2 shows a test-problem we use in this paper for illustration. On the left, 
some “object” is shown which distinguishes itself from background and clutter 
by the relative angles of edgels. Such edgels typically arise as output of some 
local edge detector. Accordingly, the difference between the relative angles 
and the expected ones (due to our knowledge about the object) were chosen as 
similarity measure dij with respect to primitives i and j. In addition, we take 
into consideration inaccuracies of a (fictive) preprocessing stage by virtue of a 
parameter (3 (see Figure 2): 

d’ij GXp( expected ) ) ■ 

k 

Clearly, this measure is invariant against translation, rotation and scaling of 
the object. On the right in Figure 2, an unknown number of translated and scaled 
copies of the object, which has been rotated in advance by an unknown angle, 
is shown together with a lot of noisy primitives as “background” . Trying to find 
these objects leads to combinatorial search. By contrast, we are interested in 
suboptimal minimizers of the functional (3) computed by convex programming. 

3 Convex Problem Relaxation 

Recall that both problems (2) and (3) (and many others) are special cases of 
problem (1). 
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In order to relax problem (1), we first drop the constant and homogenize the 
objective function as follows: 



x^Qx + 26 * X 





With slight abuse of notation, we denote the vector (x l)* again by x. 
Next we introduce the Lagrangian with respect to problem (1): 



( 4 ) 



x^Lx - ^ y,{xl - 1) = x*(L - D{y))x + e*y 

I 

and the corresponding minimax-problem: 

sup inf x*(L — D{y))x + e*y . 
y ^ 

Since x is unconstrained now, the inner minimization is finite-valued only if 
L — D(y) G = 1C (for notation, see section 1). Hence we arrive at the 

relaxed problem: 

supe*y, L-D{y)GlC. (5) 

V 

The important point here is that problem (5) is a convex optimization problem! 
The set /C is a cone (i.e. a special convex set) and self-dual, that is it coincides 
with the dual cone [24] 

1C* = {Y : X •¥ >0, X e IC} . 



To obtain the connection to our original problem, we derive the dual problem 
associated with (5). Choosing a Lagrangian multiplier X E X* = X, similar 
reasoning as above yields: 

sup e*y = sup inf e*y + X • {L — D{y)) 

y V ^ sk : 

< inf sup e*y + X • {L — D{y)) 

X^K. y 

= inf supL • X - D{y) • {X - I) . 

X y 

The inner maximization of the last equation is finite iff D{X) = I. Hence, we 
obtain as the problem dual to (5): 

infL.X, D{X) = I. (6) 

A 

which again is convex. 

In order to compare the relaxation (6) with the problems (1) and (4), respec- 
tively, we rewrite the latter as follows: 

inf x^Lx = inf L • xx* . 
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Note that the matrix xx^ S /C and has rank one. A comparison with the relaxed 
problem (6) shows that (i) xx^ is replaced by an arbitrary matrix X G 1C (i.e. the 
rank one condition is dropped), and (ii) that the integer constraint Xi 6 { — 1,1} 
is weakly imposed by the constraint D{X) = / in (6). 

In the following sections, we will examine the relaxed problem (6) with re- 
spect to the criteria discussed in Section 1. 

4 Algorithm 

The primal-dual pair of optimization problems (6) and (5), respectively, belongs 
to the class of conic programs. The elegant duality theory corresponding to this 
class of convex optimization problems can be found in [24]. For “well-behaved” 
instances of this problem class, optimal primal and dual solutions X*,y*,S* 
exist {S denotes a matrix of slack variables) and are complementary to each 
other: X* • S* = 0. Moreover, no duality gap exists between the optimal values 
of the corresponding objective functions: 

L»X* - e^y* = S* »X* = 0 . 

To compute X*,y* and S*, a wide range of iterative interior-point algorithms 
can be used. Typically, a sequence of minimizers Xjj,yrf,Sr,, parametrized by a 
parameter rj, is computed until the duality gap falls below some threshold e. 
A remarkable result in [24] asserts that for the family of self-concordant barrier 
functions, this can always be done in polynomial time, depending on the number 
of variables n and the value of e. 

For our experiments described in the following two sections, we chose the so- 
called dual-scaling algorithm using public software from a corresponding website 
[25]. To get back the solution x to (1) from the solution X to (6), we used the 
randomized-hyperplane technique described in [20]. 

A more detailed description of the algorithm, along with useful modifications 
according to the problem class considered, is beyond the scope of this paper and 
will be reported elsewhere. 

5 Performance: Ground- Truth Numerical Experiments 

In this section, we investigate the performance of the relaxed problem (6) ex- 
perimentally. To this end, we report the statistical results for three different 
ground-truth experiments using one-dimensional random signals. 

We chose one-dimensional signals in this section because ground truth (the 
global optimum) can be easily computed using dynamic programming. Numer- 
ical results concerning two-dimensional signals (images) and grouping experi- 
ments are reported in section 6. 

In what follows, we denote with x* the global minimizer of (1), and with x 
the suboptimal solution reconstructed from the solution X to the convex pro- 
gramming problem (5), (6). 
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5.1 Ground- Truth Experiments: Partitioning of Random Signals 

For the first series of experiments, we generated 1000 random signals, each with 
256 pixel values equally distributed in the range [ — 1, 1]. Figure 4, top, shows an 
example. 

To investigate the performance of the relaxed problem we compare the global 
optimum with the results from the relaxed problem. The optimal objective func- 
tion is bounded as follows: 

inf L.X < J(x*) < Jix) . (7) 

XeK,D(X) = I 

The left inequality holds true due to the relaxation of problem (1), as described 
in Section 3. The right inequality is obvious because x* is the global minimizer. 

To evaluate this relationship numerically, we used the following quantities: 

J*: the sample mean of the global optimum J* = J{x*) of the functional (2) 
(computed with dynamic programming), 

AJ: the sample mean of the gap AJ = J — J* (measured in % of the optimum) 
with respect to the objective function values of the suboptimal solution 
J{x) and the optimal solution J* , and 
CT/ij: the sample standard deviation of the gap AJ. 

The resulting values of these quantities are shown in Figure 3, for different 
values of the global parameter A (1000 random signals were generated for each 
value of A). Figure 3 shows that for reasonable values of A, the gap AJ is about 
5% of the optimal value of the objective function. 

Taking into consideration that these suboptimal solutions can be computed 
by solving a mathematically much simpler convex optimization problem, the 
quality of these solutions is surprisingly good! 

The purely random signals considered in this section exhibit another prop- 
erty: There are many solutions having similar values of the objective function 
which however differ considerably with respect to the Hamming distance. Figure 
4 illustrates this fact for an arbitrary random signal and a solution pair x,x* 
leading to a gap of AJ = 6.4%, but differing at 58 pixel-positions (=22.7%). 

On the other hand, no spatial context can be exploited for pure random 
signals. Accordingly, there is no meaningful parameter value of A which could give 
a more accurate solution. Therefore, this negative effect should not be taken too 
serious because it disappears as soon as the input signal exhibits more structure, 
as is the case for real signals. This will be confirmed in the following sections. 



5.2 Ground- Truth Experiments: 

Restoration of Noisy Signals Gomprising Multiple Scales 

In our second series of experiments, we took the synthetic signal x' depicted in 
Figure 5 which involves transitions at multiple spatial scales, and superimposed 
Gaussian white noise with standard deviation a = 1.0. Figure 7, top, shows an 
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Fig. 3. Sample mean of the optimal value of the objective function J*, sample mean of 
the corresponding gap AJ with respect to the suboptimal solution, and corresponding 
standard deviation a^j- On the average, the quality of the suboptimal solution is 
around 5%. 




0 50 100 Pixel 150 200 250 



U U L 



0 50 100 Pixel 150 200 250 

Fig. 4. Top: A purely random input signal. Middle: The optimal solution x* . Bot- 
tom: The suboptimal solution x. Although the gap AJ = 6.4% only, the Hamming 
distance between these two solutions is not small. This effect is due to missing structure 
of the input signal (see text). 



example. The goal was to restore the synthetic signal from the noisy input sig- 
nals. For each value of A, we repeated this experiment 1000 times using different 
noise signals. 

In addition to the measures introduced in the last section, we computed the 
following quantities: 
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AJ'\ the sample mean of the gap AJ' = \J—J'\ (measured in % of the optimum) 
with respect to the objective function values of the suboptimal solution 
J{x) and the synthetic signal J' = J{x'), 
a^j'- the sample standard deviation of the gap AJ'. 

The statistics of our numerical results are shown in Figure 6. Two observa- 
tions can be made: First, for values of the scale-parameter A > 1.5, the restora- 
tion is quite accurate: AJ' < 3%. Secondly, the fact AJ < AJ' indicates that 
more appropriate criteria should exist for the restoration of signals that are struc- 
tured like x' (see Fig. 5). The derivation of such functionals is not the objective 
of this paper. However, we point out that such learning problems can probably 
be solved within the general class (1). In that case, our optimization framework 
could be applied, too. 

In order not to overload Figure 6, we did not include the measures and 
CT/ij. The average values are a^j' = 3.16% and a^j = 0.80%. These values 
are significantly smaller than those of the previous experiment, and thus they 
confirm the statements made at the end of the last section. 



0 50 100 Pixel 150 200 250 

Fig. 5. Signal x' comprising multiple spatial scales. 



5.3 Ground- Truth Experiment: Real ID-Signal 

Before turning to two-dimensional signals in the next section, it is quite illus- 
trative to look at numerical results for a real one-dimensional signal, namely 
a column of the noisy image depicted in Figure 1. In Figure 8, top, the noisy 
column of this image is shown. 

The following two plots in Figure 1 show the global minimizer x* computed 
with dynamic programming, and the suboptimal solution x computed with con- 
vex programming, respectively, for an appropriate value of the scale-parameter 
A = 2. 

This result demonstrates once more the “tightness” of the convex approxi- 
mation of the combinatorial optimization problem (1). 

6 Numerical Experiments: 2D-Images and Grouping 

In the previous section, we showed the performance of the algorithm in the 
context of one-dimensional signals. We will next discuss the results of applying 
the algorithm to two-dimensional images. Computing the global optimum for 
real 2D-signals (images) is no longer possible. To demonstrate the wide range of 
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Fig. 6. Average gaps AJ and AJ' for noisy versions (a = 1) of the signal x' shown in 
Fig. 5, and for different values of the scale parameter A. According to the dominating 
spatial scales in signal x' , for A > 1.5 the quality of the restoration is remarkably good 
(below 4%). 




Fig. 7. A representative example illustrating the statistics shown in Fig. 6. Top: Noisy 
input signal. Middle: Optimal Solution x* . Bottom: Suboptimal solution x. 
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0 20 40 Pixel 60 80 

Fig. 8. Top: Column 28 of the very noisy image shown in Fig. 1. Middle: Opti- 
mal solution X* computed by dynamic programming. Bottom: Suboptimal solution x 
computed by convex programming. 



problems that can in principle be tackled by the general approach (1), we also 
include results with respect to a grouping problem (see Figure 2). 

The results concerning the restoration of the real image shown in Figure 1 
are shown in Figure 9. Taking into consideration the quite poor signal-to-noise 
ratio, the quality of the restoration is encouraging. 

Figure 10 shows the same experiment with respect to another image. Note 
that the desired object to be restored comprises structures at both large and 
small spatial scales. Again the restoration result using convex programming is 
surprisingly good. 

Next, Figure 11 shows the well known checkerboard experiment. As can be 
expected, small errors only occur at corners, that is at local structures with a 
very small spatial structure close to noise. 

Finally, the results of the grouping problem (see section 2.2) are depicted in 
Figure 12. The suboptimal solution computed by convex programming clearly 
separates structure from background, apart from a small number of edgels. The 
presence of these extra edgels however is not caused by our optimization ap- 
proach but is consistent with the chosen similarity measure which fails to label 
them as dissimilar. 

7 Conclusion and Further Work 

In this paper, we introduced a novel optimization technique to the field of image 
processing and computer vision. This technique applies to various energy mini- 
mization problems of mid-level vision, the objective function of which typically 
belongs to the large class of binary quadratic functionals. 
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Fig. 9. Arrow and bar real image, (a) Noisy original, (b), (c), (d): Suboptimal 
solutions computed by convex programming for A = 0.8, 1.5, 3.0. 




Fig. 10. Iceland image, (a) Binary noisy original, (b) Suboptimal solution computed 
by convex programming with A = 2.0. (c) Original before adding noise. 




Fig. 11. Checkerboard image, (a) Noisy original, (b) Suboptimal solution computed 
by convex programming with A = 1.5. 



The most important property which distinguishes our approach from related 
work is its mathematical simplicity: Suboptimal solutions can be computed by 
just solving a convex optimization problem. As a consequence, no additional 
tuning parameters related to search heuristics, etc. are needed, apart from the 
parameters of the original model itself, of course. 

For two representative functionals related to image labeling and grouping, 
extensive numerical experiments revealed a surprising quality of suboptimal so- 
lutions with an error below 5% on the average. Due to this fact as well as the 
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Fig. 12. Top: Input data (see Section 2.2). Bottom, left: The suboptimal solution 
computed with convex optimization (A = 0.9). Four false primitives are included which 
- according to the relative angle measure - cannot be distinguished from object prim- 
itives. Bottom, right: The true solution. 



clear algorithmic properties of our approach, we consider it as an attractive 
candidate in the context of computational vision. 

We will continue our work as follows: First, we will try to prove bounds with 
respect to the suboptimality of solutions (see Eqn. (7)). Furthermore, we will 
focus on the algorithmic properties in order to exploit sparsity of specific prob- 
lems. Finally, other problems in the general class (1) like matching of relational 
object representations, for example, will be investigated. 
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Abstract. Grouping is a global partitioning process that integrates lo- 
cal cues distributed over the entire image. We identify four types of 
pairwise relationships, attraction and repulsion, each of which can be 
symmetric or asymmetric. We represent these relationships with two di- 
rected graphs. We generalize the normalized cuts criteria to partitioning 
on directed graphs. Our formulation results in Rayleigh quotients on Her- 
mitian matrices, where the real part describes undirected relationships, 
with positive numbers for attraction, negative numbers for repulsion, 
and the imaginary part describes directed relationships. Globally opti- 
mal solutions can be obtained by eigendecomposition. The eigenvectors 
characterize the optimal partitioning in the complex phase plane, with 
phase angle separation determining the partitioning of vertices and the 
relative phase advance indicating the ordering of partitions. We use di- 
rected repulsion relationships to encode relative depth cues and demon- 
strate that our method leads to simultaneous image segmentation and 
depth segregation. 



1 Introduction 

The grouping problem emerges from several practical applications including im- 
age segmentation, text analysis and data mining. In its basic form, the problem 
consists of extracting from a large number of data points, i.e., pixels, words and 
documents, the overall organization structures that can be used to summarize 
the data. This allows one to make sense of extremely large sets of data. In human 
perception, this ability to group objects and detect patterns is called perceptual 
organization. It has been clearly demonstrated in various perceptual modalities 
such as vision, audition and somatosensation [6]. 

To understand the grouping problem, we need to answer two basic questions: 
1) what is the right criterion for grouping? 2) how to achieve the criterion com- 
putationally? At an abstract level, the criterion for grouping seems to be clear. 
We would like to partition the data so that elements are well related within 
groups but decoupled between groups. Furthermore, we prefer grouping mech- 
anisms that provide a clear organization structure of the data. This means to 
extract big pictures of the data first and then refine them. 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 283-297, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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To achieve this goal, a number of computational approaches have been pro- 
posed, such as clustering analysis through agglomerative and divisive algorithms 
[5], greedy region growing, relaxation labeling [13], Markov random fields (MRF) 
[4] and variational formulations [2,7,9]. While the greedy algorithms are compu- 
tationally efficient, they can only achieve locally optimal solutions. Since group- 
ing is about finding the global structures of the data, they fall short of this 
goal. MRF formulations, on the other hand, provide a global cost function in- 
corporating all local clique potentials evaluated on nearby data points. These 
clique potentials can encode a variety of configuration constraints and probabil- 
ity distributions [18]. One shortcoming of these approaches is a lack of efficient 
computational solutions. 

Recently we have seen a set of computational grouping methods using local 
pairwise relationships to compute global grouping structures [1,3,12,11,14,15,16]. 
These methods share a similar goal of grouping with MRF approaches, but they 
have efficient computational solutions. It has been demonstrated that they work 
successfully in the segmentation of complex natural images [8]. 

However, these grouping approaches are somewhat handicapped by the very 
representation that makes them computationally tractable. For example, in 
graph formulations [16,15,3,11], negative correlations are avoided because nega- 
tive edge weights are problematic for most graph algorithms. In addition, asym- 
metric relationships such as those that arise from figure-ground cues in image 
segmentation and web-document connections in data mining cannot be con- 
sidered because of the difficulty in formulating a global criterion with efficient 
solutions. 

In this paper, we develop a grouping method in the graph framework that 
incorporates pairwise negative correlation as well as asymmetric relationships. 
We propose a representation in which all possible pairwise relationships are char- 
acterized in two types of directed graphs, each encoding positive and negative 
correlations between data points. We generalize the dual grouping formulation 
of normalized cuts and associations to capture directed grouping constraints. 
We show that globally optimal solutions can be obtained by solving generalized 
eigenvectors of Hermitian weight matrices in the complex domain. The real and 
imaginary parts of Hermitian matrices encode undirected and directed relation- 
ships respectively. The phase angle separation defined by the eigenvectors in the 
complex plane determines the partitioning of data points, and the relative phase 
advance indicates the ordering of partitions. 

The rest of the paper is organized as follows. Section 2 gives a brief review of 
segmentation with undirected graphs in the normalized cuts formulation. Section 
3 expands our grouping method in detail. Section 4 illustrates our ideas and 
methods on synthetic data. Section 5 concludes the paper. 

2 Review on Grouping on One Undirected Graph 

The key principles of grouping can often be illustrated in the context of image 
segmentation. In graph methods for image segmentation, an image is described 
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by an undirected weighted graph G = (V, E, Vb), where each pixel is a vertex 
in V and the likelihood of two pixels belonging to one group is described by a 
weight in W associated with the edge in E between two vertices. The weights 
are computed from a pairwise similarity function of image attributes such as 
intensity, color and motion profiles. Such similarity relationships are symmetric 
and can be considered as mutual attraction between vertices. 

After an image is transcribed into a graph, image segmentation becomes a 
vertex partitioning problem. A good segmentation is the optimal partitioning 
scheme according to some partitioning energy functions, evaluating how heav- 
ily each group is internally connected (associations) and/or how weakly those 
between-group connections (cuts) are. We are particularly interested in the nor- 
malized associations and cuts criteria [15], for they form a duality pair such that 
the maximization of associations automatically leads to the minimization of cuts 
and vice versa. 

A vertex bipartitioning (Vi, V 2 ) on graph G = (V, E) has V = Vi U V 2 and 
Vi n V 2 = 0. Given weight matrix W and two vertex sets P and Q, let Cw{P,Q) 
denote the total W connections from P to Q, 

Cw{P,Q)= 

jeP,keQ 

In particular, is the total weights cut by the bipartitioning, whereas 

Cw^u^i) is the total association among vertices in Vj, ^ = 1,2. Let Vw{P) 
denote the total outdegree of P, 

VwiP) = Cw{P,y), 



which is the total weights connected to all vertices in a set P. Let Sw{P,Q) 
denote the connection ratio from P to Q, 



Sw{P,Q) 



Cw{P, Q) 
Vw{P) ■ 



In particular, Swi^hyi) is called the normalized association of vertex set Vj as it 
is the association normalized by its degree of connections. Likewise, 5w(Vi, V 2 ) 
is called the normalized cuts between Vi and V 2 . The sum of these ratios respec- 
tively over two partitions are denoted by 

fa = Ya=i *5w(Vi, V;), 
fc = ELi‘5H^(Vj,V\V(). 



Ca and Cc are called normalized associations and cuts criteria. Since Vi, 

Vj) + 5vv(V;, V\V() = 1, -l-Ec = 2, thus £„ and Cc are dual criteria: maximizing 
Ca is equivalent to minimizing We seek the optimal solution maximizing 
such that within-group associations are maximized and between-group cuts are 
minimized. 

The above criteria can be written as Rayleigh quotients of partitioning vari- 
ables. Let Xi be a membership indicator vector for group I, I = 1, 2, where Xi{j) 
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assumes 1 if vertex j belongs to group I and 0 otherwise. Let Dw be the diagonal 
degree matrix of the weight matrix W, Dw{j,j) = Let 1 denote 

the all-one vector. Let k denote the degree ratio of Vi: fc = ■ We define 

y = (1 — k) Xi — kX 2 - Therefore, the optimization problem becomes: 



max£a = 



y'^Wy 

y'^Dwy 



+ 1; 



s. t. Dw'^ = 0; 



min £(. 



y^jPw - W)y 

y'^Dwy 



yj e{l-k, -fc},Vj. 



When the discreteness constraint is relaxed, the second largest generalized 
eigenvector of {W,Dw) maximizes subject to the zero-sum constraint y^ 
L>vul = 0. For eigensystem M\y = XM 2 y of a matrix pair {Mi, M 2 ), let X{M\, 
M 2 ) be the set of distinctive generalized eigenvalues A and T{M\, M 2 , A) be the 
eigenspace of y. It can be shown that VA e X{W, Dw), |A| < 1. Let Afc denote the 
fc-th largest eigenvalue, then Ai = 1 and 1 G T{Mi, M 2 , Xi). Thus the optimal 
solution is: 



^a{jJopt) I~b A2, yopt ^ F( VF, , A2 ) . 

3 Grouping on Two Directed Graphs 

The above formulation addresses the grouping problem in a context where we can 
estimate the similarity between a pair of pixels. This set of relationships arises 
naturally in color, texture and motion segmentation. However, a richer set of 
pairwise relationships exists in a variety of settings. For example, relative depth 
cues suggest that two pixels should not belong to the same group; in fact, one of 
them is more likely to be figure and the other is then the ground. Compared to the 
similarity measures, this example encapsulates two other distinct attributes in 
pairwise relationships: repulsion and asymmetry. This leads to a generalization of 
the above grouping model in two ways. One is to have dual measures of attraction 
and repulsion, rather than attraction alone; the other is to have directed graph 
partitioning, rather than symmetric undirected graph partitioning. 



3.1 Representation 

We generalize the single undirected graph representation for an image to two 
directed graph representations G = {Ga, Gr}: Ga = (V, Ea, A), Gr = (V, Er, R), 
encoding pairwise attraction and repulsion relationships respectively. Both A 
and R are nonnegative weight matrices. Since Ga and Gr are directed, A and R 
can be asymmetric. An example is given in Fig. 1. 

Whereas directed repulsion can capture the asymmetry between figure and 
ground, directed attraction can capture the general compatibility between two 
pixels. For example, a reliable structure at one pixel location might have a higher 
affinity with a structure at another location, meaning the presence of the former 
is more likely to attract the latter to the same group, but not the other way 
around. 
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0 6 1 0 O' 

2 0 0 1 0 

1 0 0 6 1 
0 110 0 
0 0 1 2 0 _ 

a. Ga = (V,Ea,^) b. GR = (V,ER,i?) 

Fig. 1. Two directed graph representation of an image, a. Didrected graph with non- 
negative asymmetric weights for attraction, b. Directed graph with nonnegative asym- 
metric weights for repulsion. 





3.2 Criteria 

To generalize the criteria on nondirectional attraction to directed dual measures 
of attraction and repulsion, we must address three issues. 

1. Attraction vs. repulsion: how do we capture the semantic difference between 
attraction and repulsion? For attraction A, we desire the association within 
groups to be as large as possible; whereas for repulsion R, we ask the segre- 
gation by between-group repulsion to be as large as possible. 

2. Undirected vs. directed: how do we characterize a partitioning that favors 
between-group relationships in one direction but not the other? There are 
two aspects of this problem. The first is that we need to evaluate within- 
group connections regardless of the asymmetry of internal connections. This 
can be done by partitioning based on its undirected version so that within- 
group associations are maximized. The second is that we need to reflect our 
directional bias on between-group connections. The bias favoring weights 
associated with edges pointing from Vi to V2 is introduced by an asymmet- 
ric term that appreciate connections in Cw(Vi, V 2 ) but discourage those in 
Crv(V2, Vi). For these two purposes, we decompose 2 W into two terms: 

21U = W'„ + fUd, W„ = (1U + 1U^), Wd = {W-W'^). 

Wu is an undirected version of graph Gw, where each edge is associated 
with the sum of the W weights in both directions. The total degree of the 
connections for an asymmetric W is measured exactly by the outdegree of 
Wu- Wd is a skew-symmetric matrix representation of W, where each edge 
is associated with the weight difference of W edges pointing in opposite 
directions. Their links to W are formally stated below: 

CwAP.Q) = Cw{P,Q) + Cw{Q,P)= CwAQ,P), 

Cwd{Pi Q) = Cw{P, Q) - Cw{Q^ P) = -Cwrf(Q, P)- 

This decomposition essentially turns our original graph partitioning on two 
directed graphs of attraction and repulsion into a simultaneous partitioning 
on four graphs of nondirectional attraction and repulsion, and directional 
attraction and repulsion. 
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3. Integration: how do we integrate the partitioning on such four graphs into 
one criterion? We couple connection ratios on these graphs through linear 
combinations. The connection ratios of undirected graphs of A and R are 
first combined by linear weighting with their total degrees of connections. 
The connection ratios of directed graphs are defined by the cuts normalized 
by the geometrical average of the degrees of two vertex sets. The total energy 
function is then the convex combination of two types of connection ratios for 
undirected and directed partitioning, with a parameter j3 determining their 
relative importance. 

With directed relationships, we seek an ordered bipartitioning (Vi, V 2 ) such 
that the net directed edge flow from Vi to V 2 is maximized. The above consid- 
erations lead to the following formulation of our criteria. 



e,(A, R-!3) = 2P-'^ IV, 1 



(=1 



+ 2(l-/3). 



(VO 

CA. + fl.(Vi,V2)-CA, + fl,(V2,Vi) 



^[VaA'^i) + VrA^i)] ■ [P.4„(V2) + PflJV2)]’ 

e.(A, i?; /?) = 20 . V 

+ 2(1 - 0) ■ ^Ad + Kd(V2. Vl) - CAj + Kd(Vl, V 2 ) 

^[D4jVi) + Dj^„(Vi)] • [P^„(V2) + I?hJV2)] 



Note that the duality between Ca and Cc is maintained as Ca + £c = 4/3. 

For undirected graphs, 5 a„(V(,V 0 is the old normalized association by at- 
traction of set Vi; (V(, V\V;) is the normalized dissociation by repulsion of set 
V;. They are summed up using weights from their total degrees of connections: 
2?.„(V0andP«„(V0. 

For directed graphs, only the asymmetry of ‘connections matters. We sum up 
the cross connections regardless of attraction and repulsion: Cyid+fld (Vi, V 2 ) — 
^Ad+Rd i'^ 2 , Vi), normalized by the geometrical average of the degrees of the two 
involved sets. Similar to Sw{P, Q), this again is a unitless connection ratio. 

We write the partitioning energy as functions of {A, R) to reflect the fact that 
for this pair of directed graphs, we favor both attractive and repulsive edge flow 
from Vi to V 2 . They can also be decoupled. For example, the ordered partitioning 
based on ea{A'^,R;(3) favors repulsion flow from Vi to V 2 , but attraction flow 
from V 2 to Vi. 

Finally, we sum up the two terms for undirected and directed relationships 
by their convex combination, with the parameter /? determining their relative 
importance. When /3 = 1, the partitioning ignores the asymmetry in connection 
weights, while when /? = 0, the partitioning only cares about the asymmetry in 
graph weights. When (3 = 0.5, both graphs are considered equally. The factor 2 
is to introduced to make sure that the formula are identical to those in Section 
2iox A = and R = Q, i.e., e^A, 0; 0.5) + e^A, 0; 0.5) = 2. 
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3.3 Computational Solutions 

It turns out that our criteria lead to Rayleigh quotients of Hermitian matri- 
ces. Let i = \/^. Let * and ^ denote the conjugate and conjugate transpose 
operators respectively. We define an equivalent degree matrix Df.q and equiva- 
lent Hermitian weight matrix W^q, which combines symmetric weight matrix U 
for an equivalent undirected graph and skew-symmetric weight matrix V for an 
equivalent directed graph into one matrix: 

^eq = Da^ + Dn^, U = 2/3 ■ {Au — Ru + Dr^ ) = , 

We, = C/ + f-R = W", V = 2{1- f3)- {Ad + Rd) =-V'^. 

We then have: 



e 



a 



2 



E 



XfUXi 

XfD,qXl 



X{VX2 - XjVXi 

^XlD,qXl-XTD,qX2 



We can see clearly what directed relationships provide in the energy terms. 
The first term is for undirected graph partitioning, which measures the sym- 
metric connections within groups, while the second term is for directed graph 
partitioning, which measures the skew-symmetric connections between groups. 
Such complementary and orthogonal pairings allow us to write the criterion in 
a quadratic form of one matrix by using complex numbers. Let k denote degree 
ratio of Vi: k = itJ" ■ We define a complex vector z, the square of which 
becomes a real vector we used in the single graph partitioning: 

z = Vl -kXi - i ■ VkX 2 , = {l-k)Xi-kX 2 . 

It can be verified that: 

_^Z^WqqZ _^Z^{2(3D,q-Wqq)z 

zf^D.qZ' zf^D^qZ 

subject to the zero-sum constraint of = 0. Ideally, a good segmenta- 

tion seeks the solution of the following optimization problem, 

Z^WeqZ 

.„,, = argmax-^ 

s.t. De,l = 0, Vj, Zj e {\/l — k, —iVk}. 



The above formulations show that repulsion can be regarded as the extension 
of attraction measures to negative numbers, whereas directed measures com- 
plement undirected measures along an orthogonal dimension. This generalizes 
graph partitioning on a nonnegative symmetric weight matrix to an arbitrary 
Hermitian weight matrix. 

We find an approximate solution by relaxing the discreteness and zero-sum 
constraints. We, being Hermitian guarantees that when 2 : is relaxed to take 
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any complex values, 6a is always a real number. It can be shown that VA e 
X{Weq, Deq), |A| < 3, and 

^ai^opt) 2Ai, ^opt ^ '^l)- 

As all eigenvalues of an Hermitian matrix are real, the eigenvalues can still be 
ordered in sizes rather than magnitudes. Because 1 £ T{Wqq, D^q, Ai) when and 
only when R = 0 and = 0, the zero-sum constraint z^W^gl = 0 is not, in 
general, automatically satisfied. 



3.4 Phase Plane Embedding of an Ordered Partitioning 

In order to understand how an ordered partitioning is encoded in the above 
model, we need to study the labeling vector 2 ;. We illustrate the ideas in the 
language of figure-ground segregation. If we consider R encoding relative depths 
with Rd(j, fc) > 0 for j in front of k, the ordered partitioning based on 6a(A, R; [3) 
identifies Vi as a group in front (figure) and V 2 as a group in the back (ground). 

There are two properties of z that are relevant to partitioning: magnitudes 
and phases. For complex number c = a + f 6, where a and b are both real, its 
magnitude is defined to be |c| = \/a^ + 6^ and its phase is defined to be the 
angle of point (a, b) in a 2D plane: Zc = arctan K As z = \/l — kXi — i\/%X 2 , 
where k is the degree ratio of the figure, the ideal solution assigns real number 
\/l — k to figure and assigns imaginary number —i \/k to ground. Therefore, the 
magnitudes of elements in z indicate sizes of partitions: the larger the magnitude 
of Zj, the smaller the connection ratio of its own group; whereas the relative 
phases indicate the figure-ground relationships: Zzj — Zzk = 0° means that j 
and k are in the same group, 90° (phase advance) for j in front of fc, —90° (phase 
lag) for j behind fc. This interpretation remains valid when z is scaled by any 
complex number c. Therefore, the crucial partitioning information is captured 
in the phase angles of z rather than the magnitudes as they can become not 
indicative at all when the connection ratios of two partitions are the same. 

When the elements of z are squared, we get z^ = (1— fc) Ai— fc A 2 . Two groups 
become antiphase (180°) in z^ labels. Though the same partitioning remains, the 
figure-ground information could be lost in cz for constant scaling on z. This fact 
is most obvious when + Rd ~ 0, where both z and z* correspond to the 
same partitioning energy This pair of solutions suggests two possibilities: 
Vi is figure or ground. In other words, the ordering of partitions is created 
by directed graphs. When we do not care about the direction, z^ contains the 
necessary information for partitioning. Indeed, we can show that 

z^W z^ l^W 1 

Note that W^q now becomes a real symmetric matrix. 

The phase-plane partitioning remains valid in the relaxed solution space. Let 
_ 1 _ 1 1 
W = Deq^WeqDeq^ , the eigenvectors of which are equivalent (related by Dig) 
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to those of {Weq, Df.q)- Let U and V denote the real and imaginary parts of W: 
W = (7 + iV, where U is symmetric and V is skew-symmetric. We consider 
Ujk (the net effect of attraction A and repulsion R) repulsion if it is negative, 
otherwise as attraction. For any vector z, we have: 

z^Wz = ^ \zj\ ■ \zk\ ■ (UjkCos{Zzj - Zzk) + Vjksm{Zzj - Zzk)^ 
j,k 

= 2 ^ ^ |zj| ■ \z}z\ • |Wjfc| • cosi^Zzj Zzj^ Z ^ ^ \^j\ ' ^jj- 

j<k j 

We see that z^Wz is maximized when Zzj — Zzk matches ZWj^. Therefore, 
attraction encourages a phase difference of 0°, whereas repulsion encourages a 
phase difference of 180°, and still directed edge flow encourages a phase differ- 
ence of 90°. The optimal solution results from a trade-off between these three 
processes. If Vjk > 0 means that j is figural, then the optimal solution tends to 
have Zzj > Zzk (phase advance less than 90°) if Uj^ is attraction, but phase 
advance more than 90° if it is repulsion. Hence, when there is pairwise repulsion, 
the relaxed solution in the continuous domain has no longer the ideal bimodal 
vertex valuation and as a result the zero-sum constraint cannot be satisfied. 
Nevertheless, phase advance still indicates figure-to-ground relationships. 



3.5 Algorithm 

The complete algorithm is summarized below. Given attraction measure A and 
repulsion R, we try to find an ordered partitioning (Vi,V 2 ) to maximize 
ea(A,i?;/3). 

Step 1: = A + A^ = A — A'^'-, R^ = R + R'^', Rd = R — R^ ■ 

Step 2: Deq = Da^ + Dr^. 

Step 3: Wgq = 2/3 • (A„ — + DrA) + i ■ 2(1 — j3) ■ {A^ + Rd)- 

Step 4: Compute the eigenvectors of {W^q, D^q). 

Step 5: Find a discrete solution by partitioning eigenvectors in the phase plane. 



4 Results 

We first illustrate our ideas and methods using the simple example in Fig. 1. The 
two directed graphs are decomposed into a symmetric part and a skew-symmetric 
part (Fig. 2). 

This example has clear division of figure as {1,3,5} and ground as {2,4} 
because: within-group connections are stronger for nondirectional attraction 
between-group connections are stronger for nondirectional repulsion there 
are only between-group connections pointing from figure to ground for both 
directional attraction Ad and directional repulsion Rd- 
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■ 0 0 12 3 12" 
0 0 0 12 3 

12 0 0 0 12 

3 12 0 0 0 

12 3 12 0 0 




■ 0 0 0 1 O' 
0000-1 
0 0 0 0 0 
-1 0 0 0 0 
0 10 0 0 



a. Ga„ = (V, Ea„, Au) 



b. Gaj = (V, EAd,^d) 




0 8 2 0 0 
8 0 0 2 0 
2 0 0 7 2 
0 2 7 0 2 
0 0 2 2 0 




0 4 0 0 O' 
-4 0 0 0 0 
0 0 0 5 0 
0 0 -5 0 -2 
0 0 0 2 0 



C. Gr„ = (V, Er„, 



d. Grj = (V, ERj,iid) 



Fig. 2. Decomposition of directed graphs in Fig. 1. a. Nondirectional attraction 
b. Directional attraction Ad- c. Nondirectional repulsion Ru- d. Directional repulsion 

Rd- 



The equivalent degree matrix and weight matrix for j3 = 0.5 are: 
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O' 
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12' 
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0 1 O' 


0 25 
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-4 0 


0 0-1 
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,W,q = 
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11 


-7 
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+ i ■ 
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0 5 0 
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0 26 
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3 


10 


-7 


11 


-2 




-1 0 


-5 0 -2 


0 


0 


0 


0 31 




12 


3 


10 


-2 


4 




0 1 


0 2 0 



We expect that the first eigenvector of {W^q, D^q) on {1, 3, 5} has phase advance 
with respect to {2,4}. This is verified in Fig. 3. 

Fig. 4a shows that how attraction and repulsion complement each other and 
their interaction gives a better segmentation. We use spatial proximity for at- 
traction. Since the intensity similarity is not considered, we cannot possibly 
segment this image with attraction alone. Repulsion is determined by relative 
depths suggested by the T-junction at the center. The repulsion strength falls 
off exponentially along the direction perpendicular to the T-arms. We can see 
that repulsion pushes two regions apart at the boundary, while attraction carries 
this force further to the interior of each region thanks to its transitivity (Fig. 4b). 
Real image segmentation with T-junctions can be found in [17]. 

Since nondirectional repulsion is a continuation of attraction measures into 
negative numbers, we calculate the affinity between two d— dimensional features 
using a Mexican hat function of their difference. It is implemented as the differ- 
ence of two Gaussian functions: 



h{X- I7i, E 2 ) = g{X- 0, ifi) - g{X- 0, ^ 2 ), 



g{X-i^i,E) = 



(27t) 2 irl 2 



exp 
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Fig. 3. Partitioning of eigenvectors in the phase-plane. Here we plot the first two 
eigenvectors of {Weq, Deq) for the example in Fig. 1. The points of the first (second) 
eigenvector are marked in circles (squares). Both of their phases suggest partitioning 
the five vertices into {1, 3, 5} and {2, 4} but with opposite orderings. The first chooses 
{1,3,5} as figure for it advances {2,4} by about 120°, while the second chooses it as 
ground for it lags {2,4} by about 60°. The first scheme has a much larger partitioning 
energy as indicated by the eigenvalues. 



where Z”s are d x d covariance matrices. The evaluation signals pairwise attrac- 
tion if positive, repulsion if negative and neutral if zero. Assuming S 2 = 
we can calculate two critical radii, ro, where affinity changes from attraction to 
repulsion and r_ , where affinity is maximum repulsion: 

roh,d) = r_{'j,d) = y/2Td-ro{j,d). 

The case of d = 1 is illustrated in Fig. 5. With this simple change from Gaussian 
functions [15,8,11] measuring attraction to Mexican hat functions measuring 
both attraction and repulsion, we will show that negative weights play a very 
effective role in graph partitioning. 

Fig. 6 shows three objects ordered in depth. We compute pairwise affinity 
based on proximity and intensity similarity. We see that partitioning with at- 
traction measures finds a dominant group by picking up the object of the highest 
contrast; with the additional repulsion measures, all objects against a common 
background are grouped together. If we add in directional repulsion measures 
based on occlusion cues, the three objects are further segregated in depth. 

Unlike attraction, repulsion is not an equivalence relationship as it is not 
transitive. If object 3 is in front of object 2, which is in front of object 1, object 3 
is not necessarily in front of object 1. In fact, the conclusion we can draw from the 
phase plot in Fig. 6 is that when relative depth cues between object 3 and 1 are 
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at 1 1^(1 ^ ® *^d 





a. 



b. 



Fig. 4. Interaction of attraction and repulsion, a). The first row shows the image and 
the segmentation results with attraction A (the second eigenvector), repulsion R, both 
A and R (the first eigenvectors). The 2nd and 3rd rows are the attraction and repulsion 
fields at the four locations indicated by the markers in the image. The attraction is 
determined by proximity, so it is the same for all four locations. The repulsion is 
determined by the T-junction at the center. Most repulsion is zero, while pixels of 
lighter(darker) values are in front of (behind) the pixel under scrutiny. Attraction 
result is not indicative at all since no segmentation cues are encoded in attraction. 
Repulsion only makes boundaries stand out; while working with the non-informative 
attraction, the segmentation is carried over to the interiors of regions, b). Figure-ground 
segregation upon directional repulsion. Here are the phase plots of the first eigenvectors 
for R and A, R. The numbers in the circles correspond to those in the image shown 
in a). We rotate the eigenvector for A,R so that the right-lower corner of the image 
gets phase 0°. Both cases give the correct direction at boundaries. However, only with 
A and R together, all image regions are segmented appropriately. The attraction also 
reduces the figure-to-ground phase advance from 135° to 30°. 



missing, object 1 is in front of object 3 instead. When there are multiple objects 
in an image, the generalized eigenvectors subsequently give multiple hypotheses 
about their relative depths, as shown in Fig. 7. 

These examples illustrate that partitioning with directed relationships can 
automatically encode border ownerships [10] in the phase plane embedding. 

5 Summary 

In this paper, we develop a computational method for grouping based on sym- 
metric and asymmetric relationships between pairs of data points. We formulate 
the problem in a graph partitioning framework using two directed graphs to 
encode attraction and repulsion measures. In this framework, directed graphs 
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Fig. 5. Calculate pairwise affinity using Mexican hat functions based on difference of 
Gaussians. When two features are identical, it has maximum attraction; when feature 
difference is ro, it is neutral; when feature difference is r_, it has maximum repulsion. 



■ 










a. Image. 




d. Result with directed R: angle, 



b. Result with A. 




e. magnitude. 



c. Result with A and R. 




Fig. 6. The distinct roles of repulsion in grouping, a) 31x31 image. The background and 
three objects are marked from 0 to 3. They have average intensity values of 0.6, 0.9, 0.2 
and 0.9. Gaussian noise with standard deviation of 0.03 is added to the image. Object 
2 has slightly higher contrast against background than objects 1 and 3. Attraction and 
nondirectional repulsion are measured by Mexican hat functions of pixel distance and 
intensity difference with a’s of 10 and 0.1 respectively. The neighborhood radius is 3 and 
7 = 3. b) Segmentation result with attraction alone, c) Segmentation result with both 
attraction and repulsion, d), e) and f) show the result when directional repulsion based 
on relative depth cues at T-junctions are incorporated. With nondirectional repulsion, 
objects that repel a common ground are bound together in one group. With directional 
repulsion, objects can be further segregated in depth. 
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Image. T(tYeg, Deg, Al) T(lYeg, De„ Aa) T(tYeg, Deg, As) 



Fig. 7. Depth segregation with multiple objects. Each row shows an image and the 
phase plots of the first three eigenvectors obtained on cues of proximity, intensity 
similarity and relative depths. All four objects have the same degree of contrast against 
the background. The average intensity value of background is 0.5, while that of objects 
is either 0.2 or 0.8. The same parameters for noise and weight matrices as in Fig. 6 are 
used. The first row shows four objects ordered in depth layers. The second row shows 
four objects in a looped depth configuration. Repulsion has no transitivity, so object 
pair 1 and 3, 2 and 4 tend to be grouped together in the phase plane. The magnitudes 
indicate the reliability of phase angle estimation. The comparison of the two rows also 
shows the influence of local depth cues on global depth configuration. 



capture the asymmetry of relationships and repulsion complements attraction in 
measuring dissociation. 

We generalize normalized cuts and associations criteria to such a pair of 
directed graphs. Our formulation leads to Rayleigh quotients of Hermitian ma- 
trices, where the imaginary part encodes directed relationships, and the real 
part encodes undirected relationships with positive numbers for attraction and 
negative numbers for repulsion. The optimal solutions in the continuous domain 
can thus be computed by eigendecomposition, with the ordered partitioning 
embedded in the phases of eigenvectors: the angle separation determines the 
partitioning, while the relative phase advance indicates the ordering. 

We illustrate our method in image segmentation. We show that surface cues 
and depth cues can be treated equally in one framework and thus segmentation 
and figure-ground segregation can be obtained in one computational step. 
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Abstract. There have been many successful researches on image seg- 
mentations that employ Markov Random Field model. However, most 
of them were interested in two-dimensional MRF, or spatial MRF, and 
very few researches are interested in three-dimensional MRF model. Gen- 
erally, ’three-dimensional’ have two meaning, that are spatially three- 
dimensional and spatio-temporal. In this paper, we especially are in- 
terested in segmentations of spatio-temporal images which appears to 
be equivalent to tracking problem of moving objects such as vehicles etc. 
For that purpose, by extending usual two-dimensional MRF, we defined a 
dedicated three-dimensional MRF which we defined as Spatio-Temporal 
MRF model(S-T MRF). This S-T MRF models a tracking problem by 
determining labels of groups of pixels by referring to their texture and la- 
beling correlations along the temporal axis as well as the x-y image axes. 
Although vehicles severely occlude each other in general traffic images, 
segmentation boundaries of vehicle regions will be determined precisely 
by this S-T MRF optimizing such boundaries through spatio-temporal 
images. Consequently, it was proved that the algorithm has performed 
95% success of tracking in middle-angle image at an intersection and 
91% success in low-angle and front-view images at a highway junction. 



1 Introduction 

Today, one of the most important research efforts in ITS have been the devel- 
opment of systems that automatically monitor the traffic flows. Rather than the 
current practice of performing a global flow analysis, the automated monitoring 
systems should be based on local analysis of the behavior of each vehicle out 
of global flows. The systems should be able to identify each vehicle and track 
its behavior, and to recognize dangerous situations or events that might result 
from a chain of such behavior. Tracking in complicated traffics have been often 
impeded by the occlusion that occurs among vehicles in crowded situations. 

Tracking algorithms have a long history in computer vision research. In par- 
ticular, in ITS areas, vehicle tracking, one of the specialized tracking paradigms, 
has been extensively investigated. Peterfreund[l] employs the ’Snakes[2]’ method 
to extract contours of vehicles for tracking purposes. Smith [3] and Crimson [4] 
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employ optical-flow analysis. In particular, Crimson apply clustering and vector 
quantization to estimated flows. Lcuck[5] and Gardner[6] assume 3D models of 
vehicle shapes, and estimated vehicle images are projected onto a 2D imago plane 
according to appearance angle. Leuck[5] and Gardner [6] ’s methods require that 
many 3D models of vehicles be applied to general traffic images. While these 
methods are effective in less crowded situations, most of them cannot track ve- 
hicles reliably in situations that are complicated by occlusion and clutter. 

After some considerations, we have come to the idea that tracking problem 
against occlusions is equivalent to segmentation of spatio-temporal images. A lot 
of successful research efforts that had employed Markov Random Field model 
have been performed by a lot of researchers in the held of Gomputer Vision. 
And those researches include image restorations, image segmentations, and im- 
age compressions. First of all, the most fundamental work had been done by 
Geman and Geman[9] which have become the basic research of all on MRF not 
only for image restorations. And then, Chellapa, Chatterjee, and Bargdzian[10] 
has applied MRF for image compressions. Here, we are the most interested in 
image segmentation by MRF models. Methods of merging small segments into 
large segments can be seen in works of Panjwani and G.Healey[ll][12], and these 
are kinds of unsupervised segmentations. More works for unsupervised segmenta- 
tions successfully by Manjunath and R.Chellappa[13], R.Hu and M.N.Fahmy[14], 
F. Cohen and Z.Fan[15], P.Andrey and P.Tarroux[16], S. Barker and P.Rayner[17], 
and P.Rostaing, J.N. Provost and C.Collet[18]. Although all those researches were 
successful, those have applied 2D-MRF model to spatial images such as static 
images. We will then extend 2D-MRF model to Spatio-Temporal MRF model 
which is able to optimize not only spatial distribution but also temporal axis dis- 
tribution. An image sequence has correlations at each pixel between consecutive 
images along a time axis. Our S-T MRF also considers this time-axis correla- 
tion. We named the extended MRF the Spatio-Temporal Markov Random Field 
Model (S-T MRF model). 

In order to resolve the segmentation problem of spatio-temporal images, we 
had developed Spatio-Temporal Markov Random Field model [21], and the algo- 
rithm have been applied suecessfully to traffic event analyses[20]. In this paper, 
primary idea of Spatio-Temporal MRF model which is define in the previous 
paper[21] will be briefly described in Section. 3. And then, improving the pri- 
mary model, fine optimizations that enables furthermore precise segmentations 
of vehicle regions against more severe occlusions are described in Section. 4. 

2 Basic Ideas of Spatio-Temporal MRF Model 

2.1 Connecting Consecutive Images with Motion Vectors 

Strictly speaking, S-T MRF should segment spatio-temporal images by each 
pixel. This means that a cubic clique of twenty-seven pixels should be considered 
for either labeling or intensity correlations in S-T MRF in substitute for a square 
clique of nine pixels which is defined in usual 2D-MRF[9]. 
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However, this condition will work 
only in the case where objects move so 
slowly as a pixel between the consecu- 
tive images and there will be labeling 
and intensity correlations within such 
a cubic clique. Since practical moving 
images cannot be captured in so high 
frame rates, pixels consisting vehicles 
move typically in several-ten pixels be- 
tween the consecutive image frames. 
Therefore, neighbor pixels within a cu- 
bic clique will never have correlations 
of either intensities or labeling. Conse- 
quently, we defined our S-T MRF to 
divide an image into blocks as a group 
of pixels and to optimize labeling of 
such blocks by connecting blocks between consecutive images referring to their 
motion vectors. Since this algorithm were defined so as to be applicable to gray 
scaled images, all the experiments has been performed by only using gray scaled 
images in this paper. 




Fig. 1. Segmentation of Spatio-Temporal 
Images 



2.2 Initialization of Object-Map 




Fig. 2. Initializing an Object-Map 



In preparation for this algorithm, an image which consists of 640x480 pixels 
is divided into 80x60 blocks where each block consists of 8x8 pixels. Then, one 
block is considered to be a site in the S-T MRF, and this S-T MRF classifies each 
block into one of vehicle regions or, equivalently, assigns one vehicle label to each 
block. Here, only the blocks that have different textures from the background 
image will be labeled as one of vehicle regions, but the blocks that have similar 
textures with the baekground image will never labeled into any vehicle regions. 
And, such a distribution of classihed labels on blocks is referred to as an Object- 
Map. 
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Since the S-T MRF converges rapidly to a stable condition when it has a 
good initial estimation, wc use a deductive algorithm to determine an initial 
label distribution [21]. In this deductive algorithm, a motion vector of each block 
will be estimated by simple block matching technique, and then a representative 
motion vector of a cluster will be determined as the most frequent motion vector 
among all the consisting blocks(Figure.2(a)). In determining the initial state of 
Object-Map all the blocks of a cluster will be translated in the next Object-Map 
by referring to this representative motion vector which can be approximately 
considered as a motion vector of each consisting block(Figure.2(b)). After the 
translation, if neighbor blocks of the translated blocks have a different texture 
from the background image, they will be labeled as the cluster. Although a 
cluster eventually include two or tree vehicles at first, they would be divided 
after a while by background images. 

2.3 Optimization of Object-Map 

However, when vehicles are occluding each others, some blocks in the occluded 
region will be labeled as both of the two vehicles. Such ambiguous labeling will 
be optimized by Spatio-Temporal MRF model. In the Section. 3, the primary 
idea of this Spatio-Temporal MRF will be described in detail. Since the S-T 
MRF model in Section. 3 approximates a motion vector of each consisting block 
by representative motion vector of the cluster, and since the S-T MRF model in 
Section. 3 will optimize segmentations by referring to correlations only between 
the consecutive images, we call this algorithm as ’primary S-T MRF model’. 

In Section. 4, this primary S-T MRF model will be improved to optimize seg- 
mentations globally through spatio-temporal images in order to segment vehicle 
regions precisely even in the most severe occlusion situations as in low-angle and 
front-view images. 

3 Primary Spatio-Temporal MRF model 

3.1 Details of primary S-T MRF model 

Some blocks may be classified as having multiple vehicle labels due to occlusion 
and fragmentation. We can resolve this ambiguity by employing stochastic relax- 
ation with Spatio-Temporal Markov Random Field (MRF) model. Our Spatio- 
Temporal MRF estimates a current Object-map (a distribution of vehicle labels) 
according to a previous object map, and previous and current images. Here are 
the notifications: 

— G{t — 1) = g, G{t) = h: An image G at time t — 1 has a value g, and 
G at time t has a value h. At each pixel, this condition is described as 

G{t - l;*o) = 

— X{t — 1) = x,X(t) = y. An object Map X at time t — 1 is estimated to 
have a label distribution as x, and X at time t is estimated to have a label 
distribution as y. At each block, this condition is described as Xk{t — 1) = 
Xk,Xk{t) = yk, where fc is a block number. 
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We will determine the most likely X{t) = y so as to have the MAP (Maximum 
A posteriori Probability) for given G{t — 1) = g,G{t) — handX(t — 1) — x, 
previous and current images and a previous object map, and a previous object 
map X{t) = y. A posteriori probability can be described using the Bayesian 
equation: 



P{X(t) = y\G(t - 1) = 3 , X{t - 1) = X, G{t) = li) = 

P{G{t -l) = g, X(t -l] = x, G{t) = h\X{t) = y)P{X(t) = y) 
P(G{t -l) = g, X{t - 1) = X, G{t) = h) 



P{G{t - 1) = g,X{t — 1) = x,G{t) = h), a probability to have previous and 
current images and a previous object map, can be considered as a constant. 
Consequently, maximizing a posteriori probability is equal to maximizing P{G{t— 
1) = g,X(t - 1) = x,G{t) = h\X{t) = y)P{X{t) = y). 

P{X{t) = y) is a probability for a block Ch to have Xk{t — 1) = yk (for all 
fcs). Here, yk is a vehicle label. For each Ck, we can consider its probability as 
a Boltzmann distribution. Then, P{X{t) = y) is a product of these Boltzmann 
distributions: 



P{X{t) = v) = J]exp[-[/iv(iV,J]/Z„fc 

k 



Y\exp[--^{Ny^^ - yNyf]/ZNk 
k 



( 2 ) 



Here, Ny^, is the number of neighbor blocks 
of a block Cfc (Figure. 3) that belong to the 
Object ID is same as Ck same veliicle as Ck- Namely, the more neigh- 
g bor blocks that have the same vehicle label, 

the more likely the block is to have the vehicle 
label. Currently, we consider eight neighbors 
as shown in Figure. 3. Thus, =8, because 
Fig. 3. 8 neighbor blocks the probability related to block Ck has maxi- 
mum value when block Ck a,nd all its neighbors 
have the same vehicle label. Therefore, the en- 
ergy function UN{Xy ^ ) takes a minimum value 
at = 8 and a maximum value at Ny =0. 

We also consider the probability of G{t — 1) = g,G{t) = h, X{t — 1) = x 
for a given object map X{t) — y as a Boltzmann function of two independent 
variables: 

P{G{t - 1) = g, X{t - 1) = X, G[t) = h\X(t) = y) 

— CXp[ Pprei^AIxy^.-, Dxyf^^\! Z DMk 
k 






















= Y^exp[-UM{Mxy^)]/ZMk ■ Y\_exp[-UD{Dxyk)]/ Zok 

k k 

~ i. J. {Alxyy — PMxy) \/ZMk ■ exp\— {Dxyy ~ POxy) \IZok 



(3) 



Mxyy is a goodness measure of the previous object map X{t—1) = x under a 
given current object map X (t) = y. Let us assume that a block Ck has a vehicle 
label Om ill the current object map X{t), and Ck is shifted backward in the 
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I ’ ’ • Object ID was estimated as Om □ Object ID is assumed as Om 






Fig. 5. Texture Matching 



Fig. 4. Neighbor condition between Con- 
secutive Images 



amount of estimated motion vector, —Vq^ = (—Vmi, —Vmj) of the vehicle Om, 
in the previous image (Figure. 4). Then the degree of overlapping is estimated 
as Mxy^., the number of overlapping pixels of the blocks with the same vehicle 
labels. The more pixels that have the same vehicle label, the more likely a block 
Ck belongs to the vehicle. The maximum number is = 64, and the energy 

function UM{Mxy^) takes a minimum value at Mxy^. = 64 and a maximum value 
at Mxy^ = 0. 

For example, when a block is determined to which of vehicle Oi , O2 it belongs, 
UM{Mxy^) will be estimated as follows. First, assuming that a block belongs 
to Oi, the energy function is estimated as UM{Mxyf.) = Umi by referring to 
— Vbi = {—vii,—vij). Then assuming that a block belongs to O2, the energy 
function is estimated as UM(Mxy^) = Um2 by referring to — V02 = {^^2%, —V2j)- 
As result of these estimations, when Umi is less than Um2, this block more likely 
belongs to vehicle Omi- 

Dxyk represents texture correlation between — and G{t). Let us suppose 
that Gk is shifted backward in the image G{t — 1) according to the estimate 
motion vector — Vb„ = {—Vmi, —Vmj)- The texture correlation at the block Ck 
is evaluated as(See Figure. 5): 

Dxvk = ^ \G{t\i + di,j + dj)-G{t-l-,i + di~Vm.i,j + dj-Vmj)\ (4) 

0<di<8,0<(i?<8 

The energy function UoiDxy^) takes maximum value at Dxy^ = 0. The smaller 
Dxyk is, the more likely Ck belong to the vehicle. That is, the smaller UoiDxy^) 
is, the more likely Ck belong to the vehicle. For example, when a block is de- 
termined to which of vehicle 0i,02 it belongs UniOxy,.) will be estimated as 
follows. First, assuming that a block belongs to Oi, the energy function is esti- 
mated as UniOxy^.)) = Udi by referring to = {vu,vij). Then assuming that 
a block belongs to O2, the energy function is estimated as UoiOxyk)) = Ud2 
by referring to V02 = {v2i,V2j)- As result of these estimations, when Udi is less 
than Ud2, this block most likely belongs to vehicle Odi- 
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Consequently, this optimization problem results in a problem of determining 
a map X[t) = y which minimizes the following energy function. 

C(j/fc) = U + Uprei{Dxyi., Mxyi,) = U N^Nyy) + U oi^Dxyj^) + Um {M^yy) 

= a{Ny^ - y,Ny f + h{Mxyy - f + cOly^ (5) 

U{yk) is considered to be the energy function for Spatio-Temporal MRF, and 
U{yk) will be minimized by the relaxation process. 



3.2 Experimental Results 




frame 678 frame 696 frame 702 frame 710 

(a)Tracking Images 
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(b)Object-Maps 

Fig. 6. Tracking results by S-T MRF 



Figure. 6 shows a sequence of tracking two vehicles that caused an occlusion 
situation. These images are obtained at the rate of 10 frames/second, and a frame 
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number is attached to each image. Although a car is partly occluded behind a 
truck, the two vehicles have been successfully segmented. 

We applied the tracking algorithm utilizing the Spatio-Temporal MRF model 
to 25 minute traffic images at the intersection. Three thousand, two hundred 
and fourteen vehicles traversed the intersection; of these, 541 were occluded. As 
a result, the method was able to track separated vehicles that did not cause 
occlusions at over 99% success rate, and the method was able to segment and 
track 541 occluded vehicles at about 95% success rate. 

4 Global Optimization through Accumulated Images 

Unfortunately, most of the images captured by cameras on infrastructures are 
low- angle images, and many of them are front- view images. Therefore, in order to 
construct traffic monitoring system to be of practical use for various situations, 
it is necessary to apply this Spatio-Temporal MRF model to those low-angle im- 
ages and front- view images. However, some characteristics of such images would 
impede successful tracking as follows: Firstly, more severe occlusions occur in 
cases of low-angle images than middle-angle images which arc used in Section. 3. 
Since occlusion situations would occupy a long period in front-view images, ve- 
hicles cannot be divided until they are just under the camera. Secondly, many of 
vehicles that occlude one another move at almost equal motion vectors in cases 
of front-view images. These similarities in motion vectors would cause ambiguous 
boundaries among occluded vehicles. 

In order to resolve these problems, it is necessary to apply global optimiza- 
tions though accumulated images as well as between consecutive images. By 
observing the principle that tracking problems correspond to segmentation of 
spatio-temporal images, this improved algorithm will be also applied to middle- 
angle images as described in Section. 3. 

4.1 Applying S-T MRF Backward along Temporal Axis 

In order to resolve the first problem, it will be effective to apply S-T MRF model 
backward along temporal axis; we call this procedure ’reversed S-T MRF’. Since 
the Spatio-Temporal images are symmetrically arranged along temporal axis, 
this reversed S-T MRF model will be able to divide each vehicle backward to the 
previous images. In practice, about fifty images along with their corresponding 
Object-maps are accumulated; the S-T MRF model is applied to such accumu- 
lated spatio-temporal images backward to the previous images with re-mapping 
the Object-maps. 

When a cluster is split by background image, S-T MRF model is applied 
backward along the temporal axis. Since segments of vehicles can be divided 
individually backward to the previous images against occlusions by this reversed 
S-T MRF, we will be able to know precise behaviors of such vehicles backward 
along the temporal axis. Without this process, about half of vehicles will not be 
divided successfully in low-angle and front-view images until they arrive in just 
front of the camera. 
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4.2 Optimization of Motion Vectors 




(a) Failure by Primary S-T MRF 




(b) Optimizing respective Motion 
Vectors 



Fig. 7 . Optimizations of Motion Vectors 



Using the primary algorithm of S-T MRF which was defined in Section. 3, 
blocks belonging to a cluster arc rc-mapped into the next Object-map by re- 
ferring to a representative motion vector which was estimated for a cluster. 
However, when a cluster include two or tree vehicles, this re-mapping method 
may cause a failure in tracking. For example, as shown at the image frame t—1 in 
Figure. 7(a), two vehicles were included in Cluster-1 where Vehicle-1 is occluded 
behind Vehicle-2, and Cluster-2 was very close to Cluster-1. V-2 then gradu- 
ally occludes V-3 separating from V-1 at frame t + 1. Finally, V-2 completely 
separates from V-1 at frame t + 2. By only using the primary algorithm of S-T 
MRF, the algorithm may not be able to recognize that Cluster- 1 has include two 
vehicles. 

Such a failure occurs when a representative motion vector of a cluster includ- 
ing V-1 and V-2 is determined to be similar to a representative motion vector 
of V-1, but to be dissimilar to a representative motion vector of V-2. Therefore, 
by using only the primary S-T MRF, blocks belonging to V-2 are re-mapped 
by referring to a representative motion vector of a cluster which is similar to a 
representative motion vector of V-1. Therefore blocks included in Vehicle-2 will 
gradually penetrate into Clustcr-2. Unfortunately, in low-angle and front-view 
images, such a kind of failure occurs frequently. 

Therefore, we have come to the conclusion that each block should be re- 
mapped by referring not to a representative motion vector of a cluster but 
rather to a motion veetor charaeteristics of each block included in the clus- 
ter(Figure.7(b)). Here, we call a group of such a motion vector characteristic 
of each block as respective motion vectors. By re-mapping each block referring 
to each respective motion vector, segment boundary of Cluster- 1 would extend 
appropriately according to V-2 moving apart from V-1; V-1 and V-2 will then 
be divided by background images(Figure.7(b)). 
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However, since all of the blocks’ motion vectors would not be expected to 
be estimated appropriately only by block matching because of their poor tex- 
tures, failures would occur in determining appropriate boundaries among clus- 
ters. Therefore, in order to optimize such boundaries, it appears to be effective 
to optimize a motion vector with respect to each block respectively by referring 
to motion vectors of neighbor blocks. This condition can be defined as the energy 
function in the following equation(6). 

P(Cfe,,---,Cfc„) = exp[-C/™(Cfe)]/Z™; C/^„(C'fc) = ^ (6) 

Here, Vc^ and Vb^. represents the estimated motion vector of the block Ck and 
its neighbor blocks Bk respectively. Summation will be estimated over blocks 
Bk that have same labels as Ck{see Figure. 3). This energy function(6) suggests 
that neighbor blocks should have similar motion vectors one another. 

In this algorithm, each block is re-mapped into the next Object-Map by 
referring to the motion vector with respect to the block itself instead of the 
representative motion vector of the cluster. Consequently, in determining a label 
out of alternative labels, this algorithm minimizes the following energy function; 

U{yk{t)) + fUmv{Ck{t - 1)) = a{Ny^^ - y.Nyf + b{My,yy - + cUly^ 

+ (7) 

Bk 

Here U{yk) is defined as function(5) at T — t] energy terms of UMiM^yy) and 
UniDxyk) will be evaluated by referring to respective motion vectors of blocks 
belonging to the cluster. Umv{Ck{t — ^)) will be estimated by using motion vectors 
at r = t — 1; C7fe(t — 1) represents the original block of Ck{t), Nxy represents the 
number of neighbor blocks that have same label as Ck{t ~ 1). 

Therefore, motion vectors of blocks at T — t — 1 and Object-Map atT — t will 
be optimized simultaneously by considering both similarities in motion vectors 
among neighbor blocks and in texture correlations between consecutive images. 



4.3 Merging Fragmental Segments 

In spite of the optimization process in previous subsection, boundaries are some- 
times still ambiguous due to similarities in motion vectors and poor textures of 
the blocks that belong to different vehicles. Such ambiguous boundaries would 
cause immoderate segmentations of a single vehicle. 

For example, consider the case that Vehicle-1 comes close to occlude Vehicle- 
2, and then the boundary of the vehicles became ambiguous where a segment 
of Vehicle- 1 includes a part of blocks which really belong to Vehicle-2. When 
Vehicle- 1 move apart from Vehicle-2, the segment of Vehicle- 1 will be divided 
into two segments by background images (Figure. 8 (a)). Although there appears a 
fragmental segment on Vehicle-2, the segment should be really included into the 
segment of Vehicle-2. On the other hand, there is a case that such segmentations 
are correct(Figure.8(b)). That is the case where a cluster including Vehicle-1 
and Vehicle-3 came to occlude Vehicle-2, and then Vehicle- 1 move apart from 
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Vehicle-3. In this case, there should be really three segments. Therefore, such 
immoderate segmentations as in Figure. 8(a) could not be found a priori. 

Consequently, after the immoderate segmentations by reversed S-T MRF, 
appropriate segment boundaries should be determined a posteriori by a seg- 
ment merging process. In order to quantitatively examine whether Vehicle-2 and 
Vehicle-3 are likely to be merged into a single cluster, we dehne the following 
functions: 



P — ea:p[ {Uconnect T (8) 

U connect — txN connect T I^Nno^onnect (9) 

Umv ~ 'y ^ ^ [{(Va;m Vx2^ T {Vyrn } 

connected 

+ {{Vxni - Vxsf + {Vym - Vys)^}] (10) 

Here, N connect means number of frames that segments of Vehicle-2 and Vehicle-3 
are connected via blocks differ from a background image. And Nno^onnect means 
number of frames that segments of Vehicle-2 and Vehicle-3 are not connected as 
described in frame k in Figure. 8(b). {Vx 2 ,Vy 2 ) and {vx 3 ,Vy 3 ) represents motion 
vectors of Vehicle-2 and Vehicle-3, respectively, and {vxm, Vym) represents motion 
vectors of a cluster which is a merged segment of Vehicle-2 and Vehicle-3. And 
summation is estimated only for connected frames. 

By using such defined P, likelihood of merging segments of Vehicle-2 and 
Vehicle-3 into a single segment is written as Pmerge = When the total 

energy function Uconnect + Umv = 0, P becomes 1 and becomes 0.5. That 

is, when Uconnect + Umv = 0, likelihood and unlikelihood of merging segments 
arc the same. And the lower total energy Uconnect + Umv becomes, the more the 
likelihood of merging segments increases. This likelihood function Pmerge will be 
estimated between all pairs of segments, and it will then be determined with the 
probability Pmerge whether they should be merged into a single segment. 





(a) Immoderate case 



(b) Correct case 



Fig. 8. Spatio-Temporal Object-Maps by reversed S-T MRF 
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4.4 Experimental Results 

In order to verify effectiveness of the global optimization algorithm, we applied 
this algorithm to forty minute long images of a junction on the Tokyo Metropoli- 
tan Expressway. During these forty minute images, 2,381 vehicles have passed 
this junction. In this experiment, 54 frame images and 54 Object-Maps were ac- 
cumulated. Since images are captured in a rate of 10 frames/second, 54 frames 
corresponds to 5.4 seconds. Typically, vehicles run from entrances to under the 
camera within about 5 seconds. Therefore, applying reversed S-T MRF, most 
of the vehicles can be tracked from entrances. Parameters were decided by trial 
and error as: a = 1/2, b = 1/256, c = 32/1000000, / = 1/4, = 0. 

First of all. Table. 1 shows 
optimization levels that employ 
primary S-T MRF, reversed S-T 
MRF, optimization of each mo- 
tion vector, and segment merg- 
ing. And each level employs op- 
timization algorithms that are 
indicated by ’yes’- For exam- 
ple, Level- 2 employs primary S- 

T MRF, reversed S-T MRF, 
Table 1. Optimization Levels j ^ v j.- 

^ and optimization ot each motion 

vector; however it does not employ segment merging. Here, though only 37.8% 

out of 2,381 vehicles were tracked successfully by the Levcl-1 algorithm, the 

success rate was drastically improved up to 91.2% by the Level-3 algorithm. 

Figure. 9 shows successful tracking results by applying Level-2 algorithm com- 
pared to Level-1. Figure. 9(a) shows a tracking result by the Level-1 algorithm 
which employed reversed S-T MRF and did not employed optimization of each 
motion vector. Here some vehicles were recognized as a single cluster as indicated 
by vehicle ID number of 9 and 56. Since boundaries of clusters will be ambiguous 
without optimization of each motion vector, motion vector of each block would 
be somehow different from the representative motion vector of the cluster. As 
a result, such blocks were mis-determined as belonging to a neighbor cluster. 
Therefore, evaluating Object-Maps of such blocks on ambiguous boundaries re- 
ferring to their own motion vectors, would enable us to follow such blocks more 
correctly forward along temporal axis. Then, a cluster including two or more 
vehicles can be divide correctly in a certain image frame. As a result, such ve- 
hicles would be divided into different segments backward to the past frames by 
applying reversed S-T MRF(Figure.9(b)). 

Figure. 10 shows successful tracking results to exhibit effectiveness of the seg- 
ment merging algorithm. Without segment merging, some fragmental segments 
appeared as shown in Figure. 10(a). On the other hand, such fragmental seg- 
ments are merged by the segment merging algorithm to be correct segments 
corresponding to vehicles. 





Level-0 


Level- 1 


Level-2 


Level-3 


Primary S-T MRF 


yes 


yes 


yes 


yes 


Revered S-T MRF 


no 


yes 


yes 


yes 


Optimization of 
each 

motion vector 


no 


no 


yes 


yes 


Segment Merging 


no 


no 


no 


yes 
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(a) By Level-2 (b) By Level-3 

Fig. 10. Effects of Segment Merging on Tracking results 



Fig. 9. Effects of optimization of each motion vector on Tracking results 



frame 685 
(a) By Level- 1 



frame 685 
(b) By Level-2 
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Fig. 11. Success Rates vs. Optimization 
Levels 



Figiire.il shows dependencies of 
success rates in tracking results on 
the algorithm levels. And they were 
examined by using both of middle- 
angle images of at the intersection 
and low-angle images at the ex- 
pressway junction. In middle-angle 
images, since most of vehicles were 
able to recognized vehicle by vehi- 
cle, success rate did not decrease 
even by the use of the Level-1 al- 
gorithm. On the other hand in low- 
angle and front- view images, serious 
occlusions drastically decreased suc- 
cess rate to less than 40%. 



Finally, Figure. 12 shows depen- 
dencies of success rates in tracking results on frame rates. And they also were 
examined by using both of the middle-angle images and the low-angle images. 
So far, all of experiments were performed by using images acquired at rate of 
10 frames/ second. However, as described in subsection. 2.1, one of the essential 
ideas of the Spatio-Temporal Markov Random Field model is linking images and 
Object-maps between consecutive frames by motion vectors. Therefore, it is very 
interesting to examine how the frame rates affect success rates in tracking re- 
sults. In this experiment, images were captured in the rate of iOframes/ second. 
Here in Figure. 12, frame rates can be described as [30 /x] frames /second by 
using numbers in x-axis. Therefore, T’ on x-axis corresponds to the rate of 
30 frames / second, and ’3’ corresponds to 10 frames /second. In the experiment 
of Figure. 12, Tracking algorithm of Level-3 was used; twenty minute images were 
examined for this experiment. 




0123456789 10 11 

Frame rate((30/x] frames/sec) 



Fig. 12. Success Rates vs. Optimization 
Frame Rates 



As shown in this figure, suc- 
cess rates were decreased steeply 
at 3 frames / second in both kinds 
of images. It seems that use of 
block matching algorithm to ob- 
tain motion vectors did not work 
well for low frame rate images 
as 3 frames / second because search- 
ing region for matching becomes 
too broad to find the most likely 
matched region. In Figure. 12, the 
success rate for middle-angle im- 
ages docs not decrease so steeply 
as that of low-angle images. How- 
ever, this should not suggest that 
middle-angle images are less sensi- 
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live to frame rates than low-angle images. At the intersection in middle-angle 
images, vehicles move more slowly than at the expressway in low-angle images 
because of crowded traffics, pedestrians, etc. 



5 Conclusions 

For many years, severe occlusions have been impeded automated analyses and 
monitoring of traffic images. By some considerations, we have come to a con- 
clusion that such a tracking problem is equivalent to a segmentation problem of 
spatio-temporal images. For that purpose, we defined Spatio-Temporal Markov 
Random Field model which optimizes segmentations through spatio-temporal 
images by referring to texture and labeling correlations along temporal axis as 
well as x-y axes. In this model, motion vectors of consisting blocks are principal 
to connect discretized parts of a region along temporal axis into single region. 

And then, traffic images at an intersection and highway junction were ex- 
amined in order to evaluate reliability of this S-T MRF model. As results, 95% 
of vehicles were tracked successfully in middle-angle images at the intersection, 
and 91% of vehicles were tracked successfully in low-angle and front-view im- 
ages at the highway junction. Although severe occlusions occur frequently in 
low- angle and front- view images, vehicle regions were segmented precisely by 
global optimizations of S-T MRF model. 

Finally, by using such the reliable tracking algorithm, it will be possible to 
analyze and monitor traffic events in detail and precisely even in complicated 
situations such at large intersections or highway junctions. 

6 Acknowledgment 

We would like to express our gratitude for ’.Japan Society of Traffic Engineers’ 
who provided us images of the expressway junction. 



References 

1. Natan Peterfreund, ’’Robust Tracking of Position and Velocity With Kalman 
Snakes” IEEE Trans. Pattern Analysis and Machine Intelligence(PAMI), Vol.21 
No.6, 1999, pp.564-569. 

2. M.Kass,A.Witkin,and D.Terzopoulos, “Snakes: Active contour models” Int’l 
J. Computer Vision, Vol.l, 1988, pp. 321-331. 

3. S.M. Smith and J.M. Brady, ” ASSET-2:Real-Time Motion Segmentation and Shape 
Tracking”, IEEE Trans. Pattern Analysis and Machine Intelligence(PAMI), Vol.l7 
No.8, 1995, pp.814-820. 

4. C. Stauffer and W.E.L. Crimson, “Adaptive background mixture models for real- 
time tracking”, Proc. of CVPR 1999, Jun 1999, pp246-252. 

5. Holger Leuck and Hans-Hellmut Nagel, ’’Automatic Differentiation Facilitates OF- 
Integration into Steering- Angle-Based Road Vehicle Tracking” , Proc. of Conputer 
Vision and Pattern Recognition(CVPR) ’99, pp. 360-365. 




Segmentations of Spatio-Temporal Images 313 



6. Warren F. Gardner and Daryl T. Lawton ’’Interactive Model-Based Vehicle Track- 
ing”, IEEE Trans. PAMI, Vol.18 No. 11, 1996, pp. 1115-1121. 

7. N. Metropolis, A.W.Rosenbluth, M.N.Rosenbluth, A. H. Teller and E. Teller, “Equa- 
tions of State calculations by fast computing machines”, J.Chem.Phys., Vol21, 
ppl087-1091, 1953. 

8. S. Kirkpatrick, C.D.Gelatt and M.P. Vecci, “Optimization by Simulated Annealing” , 
Science, 220, pp671-680, 1983. 

9. S.Geman and D.Geman, “Stochastic Relaxation, Gibbs Distribution, and the 
Bayesian Restoration of images”, IEEE trans. PAMI, Vol.6, No. 6, pp721-741, 1984. 

10. R.Chellappa, S.Chatterjee and R.Bargdzian, “Texture Synthesis and Compression 
Using Gaussian-Markov Random Eield Models”, IEEE trans. SMC, Vol.l5, No. 2, 
1985. 

11. D.K.Panjwani and G. Healey, “Markov random field models for unsupervised seg- 
mentation of textured color images”, IEEE Trans. PAMI, vol.l7, no. 10, pp939-954, 
1995. 

12. D.K.Panjwani and G. Healey, “Selecting neighbors in random field models for color 
images,” Proc. ICIP, vol.H, pp56-60, 1994. 

13. B.S.Majunath and R.Chellappa, “Unsupervised Texture Segmentation Using 
Markov Random Field Models”, IEEE Trans. PAMI, vol.l3, no. 5, pp478-482. May 
1991. 

14. R.Hu and M.N.Fahmy, “Texture Segmentation based on a Hierarchical Markov 
Random Eield Model”, Signal Processing, vol.26, pp. 285-305, 1992. 

15. E.S. Cohen and Z.Ean, “Maximum Likelihood Unsupervised Texture Image Seg- 
mentation”, CVGIP: Graphical Models and Image Processing, vol.54, no. 3, pp239- 
251, 1992. 

16. P. Andrey, P. Tarroux, “Unsupervised Segmentation of Markov Radom Eield Mod- 
eled Textured Images Using Selectionist Relaxation”, IEEE trans. PAMI, Vol20, 
No.3, 1998. 

17. S. A. Barker and P.J.W.Rayner, “Unsupervised Image Segmentation Using Markov 
Random Field Models”, Proc. EMMCVPR’99(Lecture Notes in CS printed by 
Springer), ppl79-194. May 1997. 

18. P.Rostaing, J.N. Provost and C. Collet, “Unsupervised Multispectral Im- 
age Segmentation Using Generalized Gaussian Noise Model., Proc. EMM- 
CVPR’99(Lecture Notes in CS printed by Springer), ppl42-156, July 1999. 

19. Rama Chellappa and Anil Jain, “Markov Random Fields : Theory and Applica- 
tion”, Academic Press, 1993. 

20. S.Kamijo, Y. Matsushita, K.Ikeuchi, M.Sakauchi, “Traffic Monitoring and Accident 
Detection at Intersections”, IEEE trans. ITS, Vol.l No. 2, June. 2000, ppl08-118. 

21. S.Kamijo, Y. Matsushita, K.Ikeuchi, M.Sakauchi, “Occlusion Robust Tracking uti- 
lizing Spatio-Temporal Markov Random Eield Model” , International Conference 
on Pattern Recognition(ICPR), Barcelona, Sep. 2000, Vol.l ppl42-147. 




Highlight and Shading Invariant Color Image 
Segmentation Using Simnlated Annealing 



Paul Fieguth and Slawo Wesolkowski 



Systems Design Engineering 
University of Waterloo 
Waterloo, Ontario, Canada, N2L-3G1 
{swesolko ,pf iegutli}@uwaterloo . ca 
h.ttp : / / ocho . uwaterloo . ca/ ~pf iegutli/ 



Abstract. Color constancy in color image segmentation is an impor- 
tant research issue. In this paper we develop a framework, based on 
the Dichromatic Reflection Model for asserting the color highlight and 
shading invariance, and based on a Markov Random Field approach for 
segmentation. A given RGB image is transformed into a R’G’B’ space to 
remove any highlight components, and only the vector-angle component, 
representing color hue but not intensity, is preserved to remove shading 
effects. Due to the arbitrariness of vector angles for low R’G’B’ values, we 
perform a Monte-Carlo sensitivity analysis to determine pixel-dependent 
weights for the MRF segmentation. Results are presented and analyzed. 



1 Introduction 

In recent years the problem of color constancy - the perception of objects in the 
real world without illumination effects - has been a major research subject in 
the image science and technology communities. In spite of shading and highlight 
effects, humans arc quite able to perceive object surfaces in a scene, a difficult 
task for computer systems. An algorithm for color image segmentation, which 
is invariant to shading and highlight effects, has recently been introduced [23], 
developed in the context of the Dichromatic Reflection Model of Shafer [12]. 

In [23] the authors describe a principal component analysis and vector angle 
clustering-based approach for color image segmentation. In this method, the pro- 
totype vector is described as the principal vector (as opposed to principal curve) 
of the RGB color cluster and the calculation of the distance from this ’’cluster 
center” to a pixel in the image is done using the vector angle. The number of 
clusters is selected and the algorithm chooses the most optimal (in the Mean 
Squared Error-sense) multi- vector fit to the data [3]. The illumination invari- 
ances are well captured by this method, however there are several drawbacks: 

1. For small (black) RGB values the algorithm breaks down and produces ex- 
tremely noisy angles. 

2. All colors must fit into a predetermined number of clusters. 

3. Border areas composed of composite colors are classified arbitrarily. 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 314-327, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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Certainly a wide variety of color-segmentation approaches have been proposed. 
In particular, methods based on color clustering have seen considerable interest, 
including k-means [15,21], fuzzy k-means [7], and morphology-based clustering 
[9] . The most notable drawback of such clustering methods is that they normally 
do not take any spatial relationships into account, and determine the segmen- 
tation strictly on a pixel-by-pixel basis, normally using the Euclidean distance. 
We will demonstrate for the problems of our interest, specifically the segmenta- 
tion of images involving illumination effects, some degree of spatial dependence 
is crucial in formulating an adequate approach. The ability for Markov/Gibbs 
methods to model spatial dependencies will make them a very natural fit to our 
context. 

Ad-hoc local methods have also been proposed for color image segmentation 
such as [5,18,20]. In [5], the authors present a method based on the calcula- 
tion of principal components of local non-overlapping regions to estimate the 
region color. The method is also said to be highlight and shading invariant. [18] 
describes a method based on a region growing technique using the Euclidean 
distance as a similarity measure which is tested on images of homogenous color. 
[20] presents another region growing technique in which each region is defined 
by two values: the color gradient (calculated using the Euclidean distance) be- 
tween two adjacent pixels and the maximum distance between two colors within 
this region. The first algorithm suffers from having to quantize the region seg- 
mentation information while the last two use the Euclidean distance. All three 
methods are based on various heuristics. 

The focus of the present paper is to formulate a color image processing 
and segmentation technique in the context of the Dichromatic Reflection Model 
[12,19], which is introduced in Section 2. The crucial question is how to measure 
the similarity of two colors. Most previous methods assess the relationship be- 
tween two multispectral (including color) pixels based on the Euclidean distance 
[7,9,21,22]. The Euclidean distance is often chosen for its simplicity, mathemati- 
cal tractability, and is well-suited to feature spaces having an isotropic distribu- 
tion (for color, a good example is the CIE Luv space [11]). However in the case 
of color images, where each pixel is represented as a RGB vector, the Euclidean 
distance is a particularly poor measure of color similarity because the RGB space 
is an-isotropic, especially when lighting effects such as specular reflection and 
shading are present in the image. 

In this paper, we propose to use the Dichromatic Reflection Model to trans- 
form the RGB image into a different space in which shading and specular reflec- 
tion are normalized. In this context, highlight and shading invariant color image 
segmentation means the finding of regions, homogenous in color, irrespective of 
illumination effects. 

Therefore, given the Dichromatic Reflection Model, why can the transformed 
pixels not be clustered effectively using k-means [10] or other related techniques? 
The problem is that sufficiently dark shades of any color all look alike (i.e., black), 
and similarly specular reflections or highlights converge to the same color (the 
color of the illuminating light, normally white). For example, Figure 1 clearly 
illustrates highlights (glossy white image patches) and shading (intensity varia- 
tions on the surface of each fruit, taken under white light illumination. Gonse- 
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Fig. 1. Original RGB color scene image, showing highlights and shading, captured 
using white light. 



quently, some sort of spatial model is essential in order to perform segmentation, 
to assign a highlight pixel to a colored group based on its surrounding context. 

We propose to define the spatial context using a Gibbs/Markov approach, as 
outlined in Section 3. Certainly others have used Markov random fields for image 
segmentation [1,6,24]; however, normally these methods involve Gauss-Markov 
random fields, where the GMRF dehnes a spatial texture for the R, G, B com- 
ponents, from which segmentation can proceed as a separate hypothesis-testing 
procedure applied to the GMRF likelihood [8]. Our approach is quite different: 
we wish to find the segmented image directly as the result of energy minimiza- 
tion of some appropriately-defined Gibbs random field. Furthermore the regions 
are not distinguished on the basis of texture, rather on shading and highlight 
invariant color. That said, textured surfaces where the pixel variations are due 
to local shading effects (such as the surface of an orange) will be segmented cor- 
rectly, since the normalized color is similar for all such pixels; whereas textures 
with intrinsically different colors (such as marble or paisley) are not the focus of 
our approach. 

The formulation of our Gibbs model will be similar to others used for seg- 
mentation [4,6] except for a number of variations due to the peculiarities of our 
transformed space. We demonstrate the advantages of constructing an energy 
function for Markov Random Field-driven image segmentation using a measure 
related to the inner vector product. 

This paper first describes the Dichromatic Reflection model and a develop- 
ment of an optimization criterion for segmentation. Next, results on an artificial 
image and a real scene image are presented and analyzed. Finally, conclusions 
and directions for future work are given. 

2 Color Theory 

The Dichromatic Reflection Model [12,49] will be used in this paper to show 
highlight and shading invariance properties of the new algorithm. First, the 
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DRM will be introduced. Next, the highlight invariance property will be briefly 
explained. Finally, how shading invariance is achieved will be described. 

2.1 Preliminaries 

The Dichromatic Reflection Model purports to separate light reflected from ob- 
jects into two different types: 

1. specular reflection or highlight characterized visually by a glossy appearance 
and describing light that is reflected in a mirror-like fashion from a surface; 

2. diffuse or body reflection which is the light reflected from a surface in all 
directions, giving a surface its usual colored appearance. 

This model has been described for a variety of materials [16]; the focus here 
will be on inhomogeneous dielectric materials such as plastics. The presentation 
of the DRM follows closely that given in [23]. First, light reflected from an 
object surface o (called the color signal) is described as a function C°{X,x) of 
wavelength A and pixel location x: 

C°{\, x) = Body Reflection + Interface Reflection (1) 

= a{x)S°{X)E{X)+l3{x)E{X) (2) 

where E{X) is the spectral power distribution of a light source, S°{X) is the 
spectral-surface reflectance of an object o, a{x) is the shading factor and (3{x) is 
a scalar factor for the specular reflection term. The following set of equations can 
then represent the sensor responses for a camera using R, G, and B coordinates: 

R ^ Br{X) 

G= G°{X,x) Rg{X) dX (3) 

_b\ J [rb{x)_ 

where Rj(A), (i = R,G,B) are the spectral sensitivity functions of the camera 
in the visible spectrum. Substituting (2) into (3), we have 

R ^ Rr{X) p Rr{X) 

G =a{x) S^X,x)E{X) Rg{X) dX + /3{X) E{X) Rg(X) dA (4) 
b\ J [Rb(A)_ 

= a{x)cb + (3{x)ci (5) 

where Cf, is the body color vector and Ci is the illumination color vector. These 
color vectors are normalized into a unit vector length. 

For the sensor outputs R, G, and B to be white balanced, it is necessary to 
satisfy the following condition: 



1 E{X)RR{X)dX = j E{X)RG{X)dX 


(6) 


= J E{X)RB{X)dX 


(7) 
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As long as the illuminant E{\) is a constant white over the visible wavelengths, 
and the spectral sensitivity functions {i = R,G,B) have the same area, 

then the above condition obviously holds. However, if the illuminant is not white, 
a color balancing step [23] is needed where the three sensor outputs are adjusted 
to be equal to each other. In this paper it will be assumed that the illumination 
light is white or the image has been white balanced. 



2.2 Highlight Invariance 

To remove the effects of highlights it is necessary to transform the pixel coordi- 
nates according to the following transformation [14,23]: 



'R'' 




'R' 




G' 


= 


G 


-AVG 


B' 




B 





( 8 ) 



where AVG represents the average value of R, G and B. In this transformation, 
the reflectance variation caused by interface reflection is removed by projecting 
the observed reflectance in an n-dimensional vector space along the illumination 
vector onto an (n-l)-dimensional subspace that is perpendicular to the illumina- 
tion vector [14]. From a practical point of view, a histogram of the RGB pixels 
making up a homogeneously-colored region containing a highlight patch would 
show two connected clusters (one for the homogenous color and one for the 
highlight) . 

For example. Figure 2 shows such a distribution in the RGB space of pixels 
from Figure 1. The four clusters appear highly spread-out and are non-linear (do 
not lie along a straight line in RGB space), because each cluster is composed of 
both body and specular reflections. 

The transformation (8) transforms each set of nonlinear clusters into a single 
linear cluster representing the body reflection. This is well illustrated in Figure 3, 
where the original nonlinear clusters now appear as linear groupings. Given that 
the RGB components are assumed to be white balanced, the application of (5)- 
(8) eliminates the interface reflection term and reduces to 



' R'' 


f 1 


■ 2Rn{X) - Rg{X) - Rb{X) ' 


G' 


= a[x) / S°{X,x)E{X)~ 


-Rr{X) + 2Rg{X)-Rb{X) 


B' 


J 3 


-Rr{X) - Rg{X) +2Rb{X) 



= a{x) 



S°{X,x)E{X) 



RrW 

R'gW 

R'bW 



dX 



(9) 



( 10 ) 



This formulation is dependent on the shading factor (illumination) and the 
body reflection (material color) , which makes this color representation highlight 
invariant. Individual elements of the pixel vector in the new representation will 
be shifted according to the average of the body reflection term. This results in 
the new space having negative coordinates. Equivalently the spectral sensitivity 
functions, i?'^(A), Rq{X), and R'^{X), in the new system also have negative 




Highlight and Shading Invariant Color Image Segmentation 319 




Fig. 2. Distribution of pixels in the RGB space from Figure 1. The straight lines are 
the principal vectors obtained with the best MSE fit from [23]. Both the red and 
orange fruits have been clumped into one larger cluster. Whereas three of the four 
clusters depicted in the image correspond to fruit colors, the fourth represents all of 
the highlight areas. 



values. Three properties were derived from this representation. The first property 
says that all RGB colors fall into one of six quadrants. The second one says that 
all gray values (including saturated highlight areas) naturally collapse to the 
(0,0,0) point. Finally, the third property demonstrates that the same color can 
only exist in quadrants that have at least one adjacent edge. 



2.3 Shading Invariance 

Insuring a shading invariance property of the algorithm means that the shading 
factor shown first in (2) needs to be eliminated from the representation obtained 
using (10). The simplest way to do this is to normalize the new color vectors to 
unit length [14]. First, reformulate (10) as 





~R'~ 


r 


-R'pixy 




[ ~\ 


/ 

c = 


G' 

B' 


= a{x) / S°{X,x)E{\) 


R'gW 

R'bW 


dX = a{x) 


A 



where Cp, Cq, Cp represent the non-factorable terms of (8). Now normalizing the 
color vector, c', we obtain: 



a{x) 

(a2(^)[(cJ^)2 + (eg,)2 + (cO^)2])l/2 



( 12 ) 
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Fig. 3. Distribution of pixels in the R’G’B’ space from Figure 1; compare with the 
RGB distribution in Figure 2. The straight lines are the principal vectors obtained 
with the best MSE fit from [23] . The alignment of the four cluster prototypes with the 
four color clusters is clearly seen; each of the four cluster prototypes corresponds to a 
colored fruit. 



= L ^ -I Qq') 

(K)2 + (cg,)2 + (c|,)2)l/2 ^ ^ 

The shading factor has been eliminated and hence this representation is inten- 
sity invariant. This operation puts all vectors on the unit hypersphere, except for 
the null vector (0,0,0) for which this operation is undefined. Since the Euclidean 
distance between two normalized transformed color vectors does not reflect ac- 
curately the perceptual difference between the two vectors, we propose to factor 
the invariance operation directly into the similarity measure calculation by using 
one minus the cosine of the vector angle 9c>^d' between two transformed color 
vectors c' ,d'] the similarity measure then becomes 



^{9c',d') = 1 — 



< c', d' > 
\c'\\d'\ 



(14) 



ac{x)ad{x){c]^d°j^ + + c%d%) 

(o2(a;)(c^c^ + + c°BC%)YI‘^{al{x){d°^ + 

Yr^R + 

Yr^r + Cg<^g + + ^G + 



(15) 

(16) 



So if c' and d' are similar in orientation then (16) will be close to zero. Both 
vectors will be deemed close irrespective of the shading factors a^x) and ad{x) 
associated with them. Therefore, this method is also shading invariant. In prac- 
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tice, the color vectors R', G', B' are normalized as in (13), which reduces (16) to 
a simple dot product calculation for each pixel comparison. 



2.4 Angle Accuracy 

The principal problem with the vector angle formulation derives from the con- 
sequences of using (8), which collapses all graylevel values to the origin in the 
transformed domain, and (13), which performs a vector normalization. Stated 
more plainly, a collection of noisy, nearly black pixels will be normalized to vastly 
different transformed locations. This is strictly a reflection of the large degree of 
sensitivity in the definition of a “hue” for nearly black pixels. 

Standard clustering approaches either require such pixels to be rejected 
(needing an arbitrary rejection threshold) or incorporate them, leading to mis- 
leading conclusions. The elegance of the MRF approach to a segmentation al- 
gorithm, set up in the next section, is that the penalty term associated with a 
vector angle can be continuous, rather than discrete admit/reject. 

The noise sensitivity of the similarity measure (16) is easily computed, as 
a preprocessing step, using Monte-Carlo means. In particular, if we model two 
pixels as noisy 



C = Cea;act + noise (17) 

d = deaact + noise (18) 

then the variance var(f2(0c',d')) can be efficiently computed; clearly if this vari- 
ance in angle difference is small then the accuracy of the angle calculation will 
be deemed high and will be weighted more heavily in the Gibbs energy. 

3 Markov Random Fields 

The modeling problems in this paper are addressed from the computational 
viewpoint. There are two primary concerns: how to define an objective function 
for the optimal solution of the image segmentation problem, and how to find its 
optimal solution. Given the various uncertainties in the imaging process, it is 
reasonable to define the desired solution in an optimization sense, such that the 
“perfect” or “exact” solution to our segmentation problem is interpreted as the 
optimum solution to the optimization objective. 

Some forms of contextual constraints are eventually necessary when trying 
to interpret visual information. The spatial and visual contexts of the objects 
in an image scene are necessary for the understanding of the scene; the context 
of object features at a lower level of representation allow the recognition of 
the objects; the context of primitives at an even lower level lets the object 
features be identified; and finally the context of image pixels at the lowest level of 
abstraction allows for the extraction of those primitives. To create a reliable and 
effective image analysis system the use of contextual constraints is unavoidable 
and therefore indispensable. 

Gibbs Random Fields (GRFs) [4,24] provide a natural way of modeling con- 
text dependencies between, for example, image pixels of correlated local features 
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[6] . The practical use of GRF models is largely possible due the improved insights 
and understanding provided by the Haniincrsley Clifford theorem [6], which al- 
lows Markov random field (MRF) modeling to be reinterpreted as an energy 
function minimization. The second motivating development is the improved in- 
sight and available methods for Gibbs sampling and Simulated Annealing. 

The MRF-based segmentation model is defined by the contextual relation- 
ships within the local neighborhood structure. Since our goal is the assertion 
of local constraints, rather than an accurate modeling of spatial textures, as in 
other GMRF color-segmentation research [8], we shall only be concerned with 
first order random fields, both simplifying the model and limiting the computa- 
tional complexity. 

Suppose we are given a color image on a pixel lattice L ~ {i,j}- As just 
discussed in Section 2, each pixel {RGB}i^ is transformed to its normalized 
representation c^ j. 

If we precompute the adjacent-pixel vector- angle criteria 



' .) ' ) 
> z + 1,3 ^,3 ’ ^,3 + ^ 



(19) 



then a Gibbs energy E for segmentation can be formulated as follows: 



UJ 



P [(1 - + + (1 ^ + 



( 20 ) 



where each pixel (i,j) is assigned an integer label 0 < < N, and where 

a, (3 control the relative constraints on the homogeneity of a single region and 
the degree of region fragmentation, respectively. 

The model (20) is intuitive and easily implemented. As mentioned before, 
it deviates from previous models used for image segmentation in that the inner 
vector product 17 is used to calculate the minimum energy instead of the Eu- 
clidean distance in R, G, B. However it misses one essential point: not all of the 
vector angles are computed with the same accuracy. Even a small amount of 
pixel noise on a dark or highlight region results in nearly totally random vector 
angles, which (20) would choose to separate into single-pixel regions. Given the 
covariance of the vector angle difference, computed by analytic or Monte-Carlo 
means as discussed in Section 2.4, we introduce weights 



1 



var(l7(0c'i . 



,)) 



var(17(0c 



J) 



( 21 ) 



to assert the degree of confidence of the terms in the energy: 



i,J 

P [(1 - (^i(*j),i(uj+i)) + (1 “ ^i(hR,i(*j+i))] (22) 

Model (22) is a very credible segmentation criterion, representing a considerable 
advance beyond standard vector-angle methods, and yet (22) is little more com- 
plicated than a standard Ising/Potts model [24] and so is well-understood and 
easily implemented. 
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Fig. 4. Boundary length problem: both regions have the same boundary length, al- 
though very different volumes. 



The primary drawback with (22) is that it is strictly a local, pixel-neighbor 
model and suffers from the same problems as other region-growing approaches: 
two vastly differently colored pixels may be grouped into a single region if they 
are linked by noisy or intermediately-colored pixels. A second undesired effect is 
that N constrains only the number of region labels, not the number of regions; 
that is, in regions of noise or color-gradients, (22) can generate a proliferation 
of small regions. Finally, the label criterion, controlled by /3, measures boundary 
length, rather than region volume (see Figure 4). Therefore, in regions where 
the vector-angle criterion is vague (that is, in saturated or dark regions), a large 
number of pixels may have to be flipped to see any change in the energy, implying 
that only the slowest of annealing schedules will successfully converge. 

A global model can overcome these drawbacks. If we associate with label I a 
global transformed color a'l then each region is forced to be well defined: 



E[{l{z,j),a'}] = J2 + 

[(1 ^ + (1 - ( 23 ) 

For the purposes of this paper, we propose to fix the region colors {ai}; that is, 
the sampling and annealing takes place only over the label indices {l{i,j)} them- 
selves. The {a;} would be found by a preceding step, such as vector quantization 
[ 21 ]. 

A final modification mirrors that of (22): the degree to which the region color 
is to be asserted at each pixel should be spatially-varying, now for two reasons: 

1. The color-dependent effect of noise, particularly for dark and highlight pixels. 

2. We are normally not interested in pixels in regions of high color gradient; at 
the very least, these pixels should not unduly influence the Gibbs energy by 
being inconsistent with the region color a. 



If we let 






var(l7(0c 



.))’ varj^{Q{0^ 



,)) 



(24) 



that is, the variances are the pointwise one, based on a noise model, and a 
spatial one, computed over a local neighborhood Af, then our segmentation model 
becomes 









P (1 ) 3^ (^ j)4(*it+i) ). (25) 
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This gives us a concise and coherent representation of the color image segmen- 
tation problem by incorporating both local and global constraints. The global 
constraints are defined by global color region labels obtained through some vec- 
tor quantization process such as the one presented in [23]. Local constraints are 
included by virtue of using pixel level constraints in the MRF model. 

Model (25) is a tradeoff between a completely local region growing approach, 
where many spurious regions can be created, and a global color clustering ap- 
proach where regions of differing color can be inadvertently merged. Further- 
more, the use of vector angle accuracy weights (21) allows the less reliable cal- 
culation of vector angle for small R’G’B’ values to be appropriately modulated. 

4 Results 

The Gibbs Sampler [4] will be used to optimize both (22) and (25). To make 
comparison as straightforward as possible, all MRF results were initialized from 
a random start, although in practice initializing from an MPG or other segmen- 
tation could accelerate convergence. For the global model (25) the label colors 
a are determined using the algorithm presented in [23] . 

Results were prepared an artificial image of colored bands, shown in Figure 5 
respectively. The artificial image varies in intensity horizontally (i.e., from left to 
right and a saturated highlight is present near the right border). Some additive 
uniform uniformly distributed noise was added to this image. 

The MFC result on the artificial image is shown in Figure 5(b). The high- 
light part is clearly a mixture of the three other segmentation classes due to 
having a nearly null vector representation in the R\G'.B' space, and the ab- 
sence of spatial constraints prevents the ambiguity from being corrected. For 
the MRF models, the results in Figure 6(a) and Figure 6(b) clearly illustrate 
the problems of boundary length discussed in Section 3, because of the lack of 
region-defining constraints such as characteristic region vector, boundary length 
or area size constraints. It is interesting to note that under careful examination, 
regions generated on both sides of the border between each color band pair are 
seldom part of the same class. Figure 6(c) demonstrates the type of result that is 
obtained using Model (23). As desired, very few highlight parts remain as other 
highlight areas have been subsumed into their adjacent regions. The remaining 
few misclassified pixels are due most probably to a too-rapid annealing schedule. 

The free parameter /? clearly controls the significance of the color-angle dot 
product in relation to the spatial label contribution in the energy term; clearly 
in the limit of a small value of /?, the MRF result converges to that of MPG. 
Finally, Figure 6(d) shows the results for the same color bands, but now where 
the vector angle calculation is weighted in terms of the accuracy to which to the 
angle can be determined (which is affected by darkness or degree of highlight), 
as in (25). 

5 Conclusions 

We have presented a Markov Random Field-based model for shading and high- 
light invariant color image segmentation. The model’s invariance properties have 
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(a) Model (20) (b) Model (22) 




(c) Model (23) (d) Model (25) 



Fig. 6. Color Band image: Results of four proposed MRF models. 



been verified using the Dichromatic Reflection Model. Furthermore, the model 
is based on a vector angle difference measure between color vectors and includes 
weights to take into account the reliability of calculating angles between various 
vector pairs. 

The MRF model used is a compromise between local-only or global-only color 
image segmentation methods. It combines the best of both worlds: the ability 
of the global methods to create well segmented regions and the ability of local 
methods to adapt to the local variations in pixel values. 

There are three immediate considerations for future work. It is not obvious 
that it is desirable to fix the region colors in models (23), (25). The obvious 
advantage of doing so is the computational consideration, however the disad- 
vantage is that any error in the vector-quantization step is locked in place and 
cannot be removed. Instead, the region colors {ai} can be variable, determined 
as part of the annealing process. Although this requires the Gibbs sampling of 
continuous values, the effort can remain reasonable if the variables are accurately 
initialized: {a/j from vector quantization or k-means [lOl, and the pixel labels 
from (22). 

Furthermore, the limitation, as illustrated in Figure 4, of using the boundary 
length as an energy metric for each segmented region, should be revisited. The 
most obvious choice would be to prefer larger regions, where region size is mea- 
sured by the number of pixels in the region. Although much more robust than 
boundary length, the number of pixels is a non-local criterion, and is therefore 
computationally much less convenient. 

Finally, parameter estimation to obtain proper convergence of the MRF mod- 
els is essential. In this paper, parameter estimation was ad-hoc. A formalized 
parameter estimation technique needs to be applied to fully evaluate the ad- 
vantages of the MRF models over vector quantization and region growing-based 
methods when applied to real scene images. 
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Abstract. The paper describes a robust edge and contour extraction 
technique under two types of degradation: random noise and aliasing. 
The technique employs unambiguous probabilistic relaxation to distin- 
guish features from noise and refine their spatial locations at sub-pixel 
accuracy. The most important component in the probabilistic relaxation 
is a compatibility function. The paper suggests a function with which the 
optimal orientation of edges can be derived analytically, thus allowing an 
efficient implementation of the relaxation process. A contour extraction 
algorithm is designed by combining the relaxation process and a per- 
ceptual organization technique. Results on both synthetic and natural 
images are given and show effectiveness of our approach against noise 
and aliasing. 

Keywords: feature extraction, relaxation labelling, segmentation 



1 Introduction 

Feature extraction is an essential part of most computer vision problems. Many 
features such as edges and corners are high frequency components and can be 
easily obscured by noise. Thus, effective feature extraction processes must incor- 
porate some degree of noise removal capability. Another major obstacle against 
reliable feature extraction is aliasing due to finite sampling of data. The aliasing 
obscures the spatial location of features. Researchers are continually working to 
overcome these problems. 

Many feature extraction algorithms proposed in literature often assume that 
noise has been reduced using some standard smoothing techniques such as 
Wiener filter, Gaussian smoothing, and non-linear diffusion [1,9,20]. A problem 
with handling noise separately from feature extraction process is that it is diffi- 
cult to determine the necessary amount of smoothing required to remove noise 
without removing actual features. Even with an optimal amount of smoothing, 
some of subtle features would be lost. Another disadvantage associated with 
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smoothing is that it further obscures the spatial location of features. The trade- 
off between signal to noise ratio (SNR) and localization accuracy is well known, 
and linear filter based techniques such as Gaussian smoothing have a theoretical 
limit in its performance in terms of the SNR and localization. 

A more reliable approach is to distinguish features from noise by a localized 
pattern analysis. The underlying assumption is that features form non-random 
patterns while noise does not. Also prior knowledge of feature patterns can 
improve the spatial localization of the features. Such pattern analysis is believed 
to be a part of the human vision processing as evidenced by vernier hyper acuity 
and contour “pop-out” [4]. 

Our research goal is to derive a reliable feature extraction and localization 
system based on simple localized pattern analysis. Such a system is not only 
for practical interest but also for theoretical one as it can bring some insight on 
segmentation mechanism of the human vision system. In this paper, we employ 
probabilistic relaxation or relaxation labeling ([22,8]) to filter out random noise 
and extract high frequency features and their spatial locations from low reso- 
lution images. The technique searches for a near optimal edge configuration in 
terms of its location and orientation at the sub-pixel resolution (i.e., we wish to 
resolve high-resolution edge contours). 

The paper suggests a general approach for designing an edge based com- 
patibility function that is a core ingredient for the relaxation process. It then 
provides a particular realization of the function that allows computationally ef- 
ficient procedure for performing the relaxation and achieving a near-optimal 
edge configuration. We then develop a contour extraction technique that com- 
bines the result of the relaxation with a perceptual organization technique. The 
computation is purely local and intrinsically parallel. 

The paper is organized as follows. Section 2 provides a brief description of 
the probabilistic relaxation followed by a detail description of how we designed a 
compatibility function for recovering edge configuration. Section 3 describes how 
to perform the relaxation process in a computationally efficient manner. Section 
4 provides a contour extraction procedure based on the relaxation and perceptual 
grouping. Section 5 gives some experimental results of the edge localization and 
contour extraction processes using both synthetic and natural images. Section 6 
provides brief discussion on other related works and some relevant neurological 
evidence. Finally, Section 7 concludes with a summary. 

2 Probabilistic Relaxation 

The technique explores global consistency through local iterative interactions or 
’’relaxation”. It measures local consistency of an object to its neighbor objects 
based on a collective sum of a simple pair-wise compatibility measure. The mea- 
sure captures the prior knowledge of the structural or contextual patterns of 
interest. At each iteration, the configuration of each object is updated so that it 
is more consistent to its neighbors. The configuration of an object is represented 
by the probability distribution of its labels or ’’states”. Through iterative lo- 
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cal interaction, the process approaches a more globally consistent configuration. 
The technique has been studied extensively in both theory and implementation 
[8,13,19,21], and found many applications in image processing and eomputer 
vision. [6,10] 

For any application of the probabilistic relaxation, a compatibility function is 
necessary. It is often defined as for two objects at i and j having labels 

Aj and Xj, respectively. The function quantifies how likely or how compatible 
the label Xi of an objeet at i is to Xj of another object at j. We can also 
associate it with the conditional probability Pr(A,|Aj). However, the specification 
of rij{Xi,Xj) is less constrained than Pr(Aj|Aj) as the former is allowed to have 
negative values and does not have to be 1. When Aj at site i is compatible 

(incompatible) with Xj at site j, the compatibility function should have a large 
(small or negative) value. 

Now a support function is defined based on the compatibility function as 

J 

where Pj{Xj) is the probability of having label Xj at j. In other words, Pj{Xj) is 
the Probability Density Function (PDF) of Xj. The support function measures 
how likely the site i is labelled as Xi, given the configuration of its neighbors. At 
last, the total support function is defined as 

^ - E E = E E E E x,)p^{x^)p,ix,) (2) 

i \i i Xi j Xj 

The total support function measures the global consistency of the particular 
configuration. The objective of the probabilistic relaxation is to maximize the 
total support by iteratively updating P^ Vi. 

In its general form, probabilistic relaxation is not computationally amiable. 
Difficulties associated with the technique are the following: 

1. It is difficult to formulate the compatibility function as the function is 4 
dimensional (i, j, Xi and Xj) in general. 

2. It is not simple to update F* as it has to be projected onto 

{pt{Xk),k = l..K\pi{Xk) e [0, 1], X^fcPi(Afc) = 1} where K is the number of 
possible labels, and evaluation of the support function is often computation- 
ally expensive. [16,18] 

The second difficulty listed above can be alleviated by using unambiguous 
relaxation [8]. With unambiguous labeling, the only label allowed for an object 
at i to take is the one that maximizes S',. By denoting the index of the label 
that maximizes S, as M{i) (i.e. M{i) = argmax^ S,(Afc)), the PDF becomes 
Pti^k) = 1 if A; = M[i) and Pi{Xk) = 0 if fc yf M{i). Then the support function 
becomes 

^ ^ ('^M(q ; ^M(y) ) i 

J 



( 3 ) 
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S,{Xk) = 0,k^M{i). (4) 

The total support function is simplified to 

^M(j))- (5) 

i 3 

Thus, the unambiguous relaxation alleviates the second difficulty listed above at 
the expense of flexibility in specifying the PDF. However, we still face a problem 
of designing the compatibility function. In the next sections, we concentrate on 
designing the compatibility function for application to edge extraction. 



2.1 Compatibility Function 

As described above, the compatibility function captures the prior knowledge of 
the patterns of interest; thus it is heavily dependent on a problem one wants to 
solve. Here, our interest is to extract edges and their attributes. This section first 
describes a general approach to design a compatibility function for edges. The 
approach is based on two assumptions: invariance to an Euclidean transform of 
the coordinate system and invariance to the global illumination level. Then the 
section describes a particular realization of the approach for extraction of edges 
under noisy conditions. This realization results in a computationally efficient 
procedure for maximizing the support function. 

First, we consider that an edge is described by three attributes: the location 
{x,y), orientation 6, and strength m. It will be represented by a vector, e{x,y) 
whose orientation and strength are Le = 9 and |e| = m, respectively. Our 
framework allows the location and orientation to be treated as a label for the 
relaxation. 

We define a compatibility function for edges as 

Gj) — /(*aj) fj/i 3x1 jy> ^^31 l^i I) (6) 

where ix and iy are the x and y coordinates of e*, respectively, likewise for jx and 
jy with respect to e^. This is a function of 7 variables. Note that the function is 
not dependent on |ej| as we do not treat the edge strength as a label. 

To simplify the design process, we assume that the compatibility function is 
invariant to both translation and rotation of the coordinate system. Then, we 
design a prototype function with jx = J y = 0 and /e^ = 0. Later, this prototype 
function is translated to (jx,jy) and rotated by lej to obtain the general form 
of r(e,, Cj). Then the prototype function can be expressed as 

Dj(e*,ej) = f{ix,iy, \ej\). (7) 

For convenience, we use a polar coordinate in describing (ix,iy)- Thus 

) f i,^ii I I ) ■ (S) 



where d^ = + iy and a^ = arctan(iy, i^)- 
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(b) 



Fig. 1. Geometrical Notations, (a) used in describing the compatibility function, 
(b) used in describing the support function maximization procedure. 



In most image capturing environments, the image irradiance is proportional 
to the scene radiance. Therefore the edge strength is also proportional to the 
radiance [25]. Our goal is to obtain a consistent edge configuration based on 
the structure of objects, as much as possible, without being influenced by the 
level of illumination. Thus, we impose a constraint that the order of support 
should be invariant to the constant scaling of the scene illumination. To be more 
precise mathematically, the ratio of total supports for two different configurations 
remains constant with respect to the change in the global illumination level. 

It can be shown that the above constraint can be satisfied if the compatibility 
function is proportional to the edge strength. Thus, the prototype compatibility 
function is 

r,j{ei,ej) = \ej\f{d^,ai, lei). (9) 

Note that this is not the only choice for achieving the invariance to the global 
illumination level. Obviously, making the compatibility function totally indepen- 
dent of the edge strength is another option. However, we found that the scaling 
of the compatibility measure by the edge strength is important as strong surface 
discontinuities can strongly influence the neighboring configuration. 

The next design step is to decompose / into a product of two terms: rioc, 
which measures the compatibility of the edge location to e^, and Tor, which 
measures the compatibility of Ze, to given dj and The functions rioc and 
Tor will be called compatibility factors. With this decomposition, / is written as 

f{di,a„ Ibj) = rioc{dt,a^\ej)ror{lei\d„a„ej). ( 10 ) 

Note that we used | notation as conditional probability to make the meaning of 
each factor more clear. One can make an analogy of the decomposition with the 
product rule of probability. This decomposition can be applied to any function 
/ and does not impose any new constraints on /. However, the design process 
becomes more tractable by breaking the compatibility relationship into two fac- 
tors. 

Our formulation of the compatibility factors is described next. The design is 
heavily based on our intuition and other alternatives are possible. 
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Compatibility Factor rioc- We arrive at our definition of rioc empirically. 
Colinearity of the Gestalt rules suggest that is most compatible with when 
it is equal to Ze^ or Ze^ + tt [11]. The degree of the compatibility decreases 
as ai deviates from the values. We use sin^(aj) to quantify the deviation. We 
also assume that the compatibility decreases as the distance between two sites 
increases. Using the ideas above as guidelines, we suggest the following function 
for rioc with distributions that are Gaussian in dj and exponential in sin^(ai). 

= ( 11 ) 

where and j3 are parameters for the Gaussian and exponential distributions, 
respectively. 



Compatibility Factor: Vov For Vgr, we use cos{lei — 4>i{di, ai, ej)). where 4>i 
specifies the most compatible Zcj to Bj. The function returns 1 when = 0, 
and —1 when e, = ± tt. It is monotonic between the two extrema. 

We found empirically that 4>i can be specified further. We collected natural 
images of various types, measured correlation of gradient angles at different 
offsets, and obtained PDFs of ZV/(x,y) — lVI{x + Ox,y + Oy) for a distribution 
of offsets {ox,Oy). Note that I represents image data and V is the gradient 
operator. We found that the PDFs are strongly peaked at 0. Figure 2 shows 
PDFs of lVI{x,y) — I {x + Ox, y + Oy) at two different offsets: (ox,Oy) = (2,0) 
and (2,2). The results suggest that the most compatible Ze^ to Bj is Ze^. Thus, 
we set (f>i = LBy 

With the compatibility factors so designed, the prototype compatibility func- 
tion is 

r{B„Bj) = |ejje~^“"''(“')“"*^/^''%os(Ze,), (12) 

and for a general case (i.e. non-prototype condition) e^, 

r(e,,6j) = |ej|e“^™ cos(Zej - Ze^), (13) 

dij is the distance between [ix,iy) and {jx,jy), and is the slope of the line 
connecting (ix,iy) and [jx,jy)- See Figure 1(a). 

The use of this compatibility factor results in a computationally efficient 
procedure for maximizing the total support defined in (5). The next section 
discusses the maximization process. 

3 Relaxation Procedure 

3.1 Maximizing Support Function 

Computational effort is a major consideration for maximizing the total support 
function. To show this, denote the number of possible edge locations by Ni, 
and the number of possible edge orientation by Nq- For simplicity, if we allow 
multiple edges to share the same site, the number of possible configurations for 
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Fig. 2. Correlation of Edge Orientations. The plot shows the PDF of ZV/(x, j/) — 
lVI{x + Ox,y + Oy) at different offsets, {ox^Oy). Both PDFs show a strong peak at 
l\7I{x,y) - l\/I{x + Ox,y + Oy = 0. (a) {ox,Oy) = (2,0) (b) (ox,Oy) = (2,2). 



each edge is NiNo, which can be quite large for moderate cases. For example, 
with Ni = 5 X 5 and = 16, NiNg = 400. Then, in order to find M{i) which 
maximizes Si in (3), a brute force method requires NiNq evaluations of S^. Thus, 
the total number of evaluation of the compatibility function is NnNiNg where 
Nn is the number of neighbor sites contributing to the sum in (3). 



Maximizing Tor- The main reason for using cos function in Cor to interpolate 
between the two extrema is to reduce the computational burden and, at the same 
time, increase the resolution of edge orientation. Assume that we are maximiz- 
ing Si in terms of Ze,. Then by using trigonometry identities, cos(a + b) = 
cos(a)cos(6) — sin(a)sin(6) and Asin(0) + B cos{9) = + B'^ cos{9 + 4>), 

(f) = arctan(A/i3), the support function can be expressed as 

Si{lei) = ^ |ej|nocCOs(Ze, - lej) = FiCos{lei - Oi) (14) 

j 



where 




= .KYI Wj\rioc cos{ej))^ + (^ \ej\ru,cSm{ej)y, 

3 



= arctan 



Ej |ej|nocSin(ej) 
Ej \ej\riocCOs{ej) 



Therefore Le^ — 6>, maximizes Si and 6>j can be computed with A^„ eval- 
uations of rioc instead of N^Nn- Also the domain of Ze* becomes continuous 
without any computational penalty. 

Another and more visually intuitive interpretation of the above formula may 
be to consider a vector fij whose length and angle are \ej\rioc and Zej, respec- 
tively, and a vector Fi whose length and angle are Fi and 0*, respectively. Then 
Fi can be computed as a vector sum of fij. Thus, 



3 



and the Ze^ that maximizes Si is IFi. 



( 15 ) 
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Obtain an initial edge configuration with some gradient operator 
do { 

for(each edge element i) { 

Compute Fj^ at the current location and its neighbor sites. 
Move the edge to the location where Fj^ is the largest 
Set the edge orientation to O^- at the new location 

1 

} until convergence 



Fig. 3. Procedure for Maximizing the Total Support. 



3.2 Procedure 

We propose a local and iterative procedure for obtaining the edge configuration 
that maximizes S. Because of its local nature, the procedure is not guaranteed to 
find the global maximum. However, it is computationally efficient, intrinsically 
parallel and effective in finding a near-optimum solution. 

The procedure updates each edge element sequentially. For each e^, Fi is com- 
puted at three different locations: the current edge location and two locations 
that are e apart from e, in the direction perpendicular to Ze,. (Figure 1(b).) The 
main reasons for this constrained search are the following. If the estimated edge 
orientation is correct, the shortest path for the edge to reach the contour is along 
the direction perpendicular to the edge orientation. Thus the maximization pro- 
cess will find the accurate contour location more quickly by moving the edge to 
the search direction. The search strategy also helps to maintain uniform spacing 
between edges on the same contour and prevents them from being attracted to 
those with high edge strength and colliding into a single point. 

Figure 3 gives a pseudo code for this procedure. We used Nitzberg-Shiota’s 
gradient operator ([17]) to obtain an initial edge configuration. For each edge 
element, it requires 3 evaluations of S', or equivalently F,. This is a significant 
reduction from N^Ni. 

The edge configuration resulting from this maximization procedure is impor- 
tant for determining object contours in noisy images. Our final goal is to use the 
resulting edge configuration to obtain high-resolution edge contours. 

4 Contour Extraction 

It is very useful in many vision applications to extract contours at sub-pixel 
accuracy. For example, an effective sub-pixel contour extraction process can aid 
data analysis of low-resolution data, improve visual quality of image expansion, 
and increase spatial accuracy of matching algorithms. 

Using the contour fragments from the edge localization process, we create a 
boundary contour that is continuous at a high resolution. Furthermore, we want 
to make the contour in such a way that the grouping result is compatible to 
our visual perception, and the process is computationally efficient. Many edge 
grouping techniques have been developed so far. We found it beneficial to develop 
another one that is tailored to the particular information available to us for both 
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computational and performance reasons. Several steps are necessary to obtain 
the result. 

First, localized edges are resampled on the new finer lattice. Assuming that 
we are interested in extracting contours at the resolution p times higher in both 
horizontal and vertical directions than the original data, the size of the resulting 
contour image is p x p larger than the original. An edge, e{x,y), is placed at 
{round{xp),round{yp)) of the new lattice, where round{x) returns the integer 
closest to X. When multiple edges reside on the same lattice, only the edge with 
the largest |e| is kept. 

Second, edges are grouped into a contour based on proximity and contin- 
uation. Since the localization and resampling processes effectively reduce both 
ambiguity of edge location and the number of spurious edges, a simple perceptual 
organization technique works well for this task. Also, since edges are distributed 
very evenly along the contour after the localization process, the search can be 
restricted to within a small neighborhood. We found that a search distance as 
small as p is often enough for our purpose. 

For each edge, our grouping procedure searches in its neighborhood for two 
edges based on some proximity and continuation criteria. For the first pair, 
we choose heuristically and empirically the following quantity to measure the 
continuation of two edges. Denoting as the current edge for the grouping 
process and as a neighbor edge with which the grouping criteria is being 
evaluated, the continuation measure, is 

Pij = cos^{aij — /.ei)cos^{aij — Ze^). (16) 

The definition of a^j is the same as before. Thus, the continuation measure 
ranges between 0 and 1. It is 1 when the orientation of each edge is either the 
same with a^j or different by tt. It is 0 when one of the edge is perpendicular to 
aij. The measure varies smoothly between the extrema. 

For the second pair, we also take the smoothness of a contour formed by 
the first pair and into consideration. Then the continuation measure for the 
second pair, p^, is the product of the smoothness measure and piy 

= sm^(0.5 * {an - a^j))p^j (17) 

where an is the slope of the line connecting the first pair. Again, the measure 
ranges between 0 and 1. It is 1 when three edges are colinear and point the 
direction of the line connecting them. It is 0 when either p^ = 0 or an = ctij- 

The proximity measure is incorporated into the order of the search. We first 
start the search in the neighborhood whose chessboard distance is 1 from the 
current lattice site (i.e. m,ax{\ix — jx\, \iy — 3y\) = !)• If the maximum of the 
continuation measure in this neighborhood is above some threshold ({, then the 
site associated with the maximum is selected and the search stops. When no 
sites have continuation measure above (, the search continues in the neighbor- 
hood at distance 2 then 3 and so on until the distance reaches over pre-defined 
maximum, D. Advantages of this strategy are, first, that the number of search 
to find a match is smaller than having the fixed search area and, second, that 
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the formulation of a ’goodness’ measure is simplified as only the continuation 
measure needs to be considered. The disadvantage is that the search can miss 
the ’best’ match when a decent match is detected before. 

Note that the grouping process is not symmetric, i.e. e, selecting as a 
grouping pair does not guarantee selecting ei. This asymmetric property is 
used to form T-junctions. 

The next step is to interpolate a pair of edges to form a contour. Typically, 
the distance between a pair of grouped edges is small, and we found a simple 
first-order polynomial interpolation is visually acceptable for 4x4 expansion 
used in our experiments. For higher expansion rates, higher order polynomials 
may be required. One alternative is to use an Essentially Non-Oscillatory inter- 
polation scheme for better preservation of corners and junctions [24]. Another 
possibility is to use the field so that the curve traces the ridge of the field. 
These alternatives will give smoother interpolation but are more computation- 
ally demanding. 

For every grouped pair of edges, an 8-digital straight segment is drawn and 
lattice sites on the segment including both starting and ending edge sites are 
marked. Then a contour is defined as a 8-connected component of marked sites, 
and the contour length is defined as the number of sites contributing to the 
contour. For details of digital straight segments and how to draw them, see [15]. 

Now Fi is computed at every contour point. When Fj is below some threshold 
t], the point is removed from the contour. After the thresholding, the procedure 
finally removes contours whose lengths are smaller than some threshold L. 

5 Experimental Results 

5.1 Edge Localization 

At low resolution, our maximization procedure effectively refines edge locations 
while simultaneously removing spurious edges. Figure 4 shows synthetic test 
images. One is without noise and the other with additive Gaussian noise. The 
signal to noise ratio of the noisy image is 1.5 The image size is 64 x 64 pixels. 
Throughout the experiments, the following set of parameters is used. 

/I = 0,CT = 0.5, e = 0.25. 

Figure 5 shows initial configuration of each test image. Edges are thick mainly 
due to 3 X 3 mask used in the Nitzberg-Shiota operator. Thus there is 3-pixel 
ambiguity in edge location even for the clean image. 

Figure 6 shows the results of the maximization procedure. It is evident from 
the result of the clean image that the procedure effectively resolved the ambiguity 
of edge location and provided more accurate locations of the edges. For the 
noisy image, the procedure combined random noise edges and produced some 
additional patterns. However, due to the random nature of these edges, the 
patterns are shorter in length than those formed by actual contour edges. They 
also tend to contain edges whose Si is small because of high curvature at the 
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Fig. 4. Synthetic Test Image. The actual size of the images is 64x64 pixels. They are 
expanded by pixel duplication for viewing. 




Fig. 5. Initial Edge Configuration. The figure shows the initial edge configuration of 
test images shown in Figure 4. 



locations. By removing edges with Si smaller than some threshold value, random 
patterns are broken into even smaller pieces, and the true patterns and random 
patterns can be separated effectively based on the contour length. 



5.2 Contour Extraction 

The results of the contour extraction process are shown in Figure 7. Parameter 
values are given in the figure caption. For the clean image, the complete contour 
boundaries are extracted with high localization accuracy. For the noisy image, 
noise edges are effectively removed while most of actual boundaries are extracted. 
For the clean image, a larger D is used to connect edges at junctions. 

This contour extraction process is applied to natural images. Results are 
given on the right in Figure 8. The process extracted subtle features without be- 
ing affected by random noise. For example, with the house image, our procedure 
delineated the outline of the roof more completely than other edge extraction 
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Fig. 6. Localized Edge Configuration. The figure shows the edge configuration of test 
images shown in Eigure 4 after 10 iterations. 



techniques we tested. With the seagull image, our procedure extracted the pat- 
tern of the feather without picking up relatively strong random patterns in the 
background. 

5.3 Comparison 

For comparison, Canny’s edge detector ([2]) is applied to the synthetic images. 
The result is shown in Figure 9. The detector uses Gaussian smoothing followed 
by a gradient operator for detecting edges. Such a simple linear operator fails to 
distinguish true boundaries from random noise. As the amount of smoothing is 
increased, the number of spurious edges decreases and at the same time the real 
surface boundaries are removed as well. 

Overall, the whole process of sub-pixel contour extraction on a 64x64 image 
expanded to 256x256 with 20 relaxation iterations took 25 seconds on a 300MHz 
SGI 02. Note that our code is not optimized for ease of maintenance (we imple- 
mented it in C++ using vector STL that contains large overheads in both speed 
and memory usage) and we believe that a significant amount of improvement 
on the speed can be achieved. The most computationally intensive part of the 
process is the relaxation, which consumed 90% of computation time. 

6 Discussion 

As stated in the introduction, our aim of this research is to design a reliable 
contour extraction system built on simple, localized, and possibly iterative pro- 
cessing. The motivation is to derive a neurologically feasible system from a prac- 
tical perspective. In the computational vision community, much research on the 
contour integration problem has been approached from faithful modeling of VI 
and V2 neurons with both excitatory and inhibitory connections in a recur- 
rent fashion[14]. It is often cumbersome, however, to build a system purely in 





340 



Toshiro Kubota, Terry Huntsberger, and Jeffrey T. Martin 




Fig. 7. Result of Contour Detection. Contour detection is applied on 64x64 synthetic 
images at the 256x256 resolution. The following set of parameters is used. Left: ^ = 0.5, 
D = 10, ?7 = 1.0, L = 75 Right: C = 0.5, D = A, rj = 1.0, L = 75. 



a bottom-up fashion. Our study attempts to bridge two extremes of pure neu- 
rological and engineering approaches and give insight in the development of a 
neuro-morphological contour extraction system that is more faithful to the neu- 
rological structures. 

In this respect, we found that the probabilistic relaxation is a suitable vehicle 
to describe a mathematically sound and numerically stable design of a recurrent 
neural network. With this view and with neural network terminology, the com- 
patibility function provides a weight between two neurons and the projection to 
the PDF space represents a subsequent non-linear transform. The formulation 
based on the probabilistic relaxation allows us to concentrate on the algorithmic 
level of development rather than the network level, which can be laborious. 

Another aspect of our development that closely ties with neurophysiologi- 
cal evidence is that the compatibility function derived in Section 2.1 resembles 
long-range interconnection patterns found in the VI areas. The long-range inter- 
actions are considered to facilitate the contour integration process. Association 
field by Field et al. [4] and the oscillatory intracortical network by Li [14] are 
models of this neural structures for the contour integration process. Although 
our formulation is derived empirically and heuristically and is not grounded to 
any physiological data, it can be replaced with other models without any fur- 
ther changes in its procedure. Our model has only two free parameters, thus it 
is simpler to implement and adjust than the above models. 

Our technique varies from other field based contour integration techniques, 
in particular works by Elder and Zucker[3], Guy and Medioni [5], Shashua and 
Ullman[23], and Williams [26]. These techniques use the field representation to 
measure the saliency of features while ours actively reconfigures the edges by 
the probabilistic relaxation process, our active reconfiguration process tends to 
increase the saliency of structured patterns more than random ones, resulting in 
clearer separation of two types of patterns. The mechanism of the reconfiguration 
process is similar to the recurrent excitation and inhibition of the VI cells and 
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Fig. 8. Result of Contour Detection on Natural Images. Contours are detected at the 
resolution 4x4 times higher than the original. The left column shows the original images 
expanded by 4x4 using pixel duplication. The right column shows the corresponding 
results. Parameters used are ^ = 0.1, D = A, r] = 0.5, L = 30 for the house and = 0.1, 
D = A, q = 0.5, L = 50 for the seagull. 



has effects on contour enhancement and texture suppression. [12] Our success in 
separating patterns reinforces Lee’s proposed mechanism. 

The research on detecting edges at sub-pixel accuracy under noisy condi- 
tion dates back to Hueckel’s work [7]. Typically, edge locations are estimated in 
the continuous domain based on theoretical modeling of image formation and 
edge detection processes. Noise is often handled either by explicit threshold- 
ing on the strength of edges or implicit smoothing by interpolation functions. 
Similarly, our technique employs image formation and edge models that are cap- 
tured in the compatibility function. However, instead of applying thresholding 
or smoothing, it delays handling noise until edges are reconfigured based on the 
relaxation process. The process is effective in isolating noise edges from actual 
object boundaries as seen in Figure 6, and it becomes easier to separate them 
by simple heuristical rules as demonstrated in Figure 7. 

It is not clear at this point if a localization process similar to the one described 
here is taking place in the visual cortex. However, as suggested by the visual 
hyperacuity, some sub-pixel measure of spatial offsets is a part of the cortical 



342 



Toshiro Kubota, Terry Huntsberger, and Jeffrey T. Martin 




Fig. 9. Result of Canny Edge Detection. Canny edge detection is applied on synthetic 
images of 4 at the 256x256 resolution. The original 64x64 images are expanded using 
bi-linear interpolation prior to the edge detection operation. 



processing. More study is necessary on both neurological and computational 
science to reach the conclusion. 

7 Conclusion 

The paper described a feature extraction process using unambiguous proba- 
bilistic relaxation to conduct local pattern analysis of an image. Through local 
iterative interactions controlled by the relaxation process, globally consistent 
patterns emerge at sub-pixel accuracy while noise is suppressed to form less 
consistent patterns. By post-processing the patterns based on the consistency 
measure derived from the support function, features can be separated from noise. 
Our formulation of the compatibility function allows efficient relaxation process 
requiring only 3 evaluations of the support function for each edge per iteration. 

We developed a contour extraction procedure on a super-resolution lattice 
based on the relaxation result and a perceptual organization technique. The 
effectiveness of the procedure is demonstrated on both synthetic and natural 
images. Comparison with Canny operator shows superior performance of our 
technique in terms of noise robustness and localization accuracy. 
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Abstract. We propose two variational models for supervised classifi- 
cation of multispectral data. Both models take into account contour 
and region information by minimizing a functional compound of a data 
term (2D surface integral) taking into account the observation data and 
knowledge on the classes, and a regularization term (ID length integral) 
minimizing the length of the interfaces between regions. This is a free 
discontinuity problem and we have proposed two different ways to reach 
such a minimum, one using a F-convergence approach and the other us- 
ing a level set approach to model contours and regions. 

Both methods have been previously developed in the case of monospec- 
tral observations. Multispectral techniques allow to take into account 
information of several spectral bands of satellite or aerial sensors. The 
goal of this paper is to present the extension of both variational classi- 
fication methods to multispectral data. We show an application on real 
data from SPOT (XS mode) satellite for which we have a ground truth. 
Our results are also compared to results obtained by using a hierarchical 
stochastic model. 

Key- words: classification, multispectral images, F-convergence, level- 
set methods, active regions, active contours. 



1 Introduction 

Variational approaches and Partial Differential Equation (PDE) models have 
shown to be efficient for a wide variety of image processing problems such as 
restoration and edge detection [1,10,20,21,22], or shape segmentation with ac- 
tive contours [8,16,19]. Nevertheless, the notion of classification, which consists 
of assigning a label to each site of an image to produce a partition of the image 
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into homogeneous labelled areas, has rarely been introduced in a variational for- 
mulation (continuous models) mainly because the notion of class has a discrete 
nature. The classification problem concerns many applications as, for instance, 
land use management in remote sensing. Many classification models can be found 
in stochastic approaches (discrete models), with the use of Markov Random Field 
(MRF) theory as for instance in [4,7,12,17]. Structural approaches such as split- 
ting, merging and region growing models [23], and few other models such as a 
combination of statistical and deterministic techniques [31,35] have also been de- 
veloped for image classification. But, to our knowledge, very few research works 
have been conducted in the field of classification by the use of variational models. 
Our goal is to built a functional whose minimum defines a regular classification 
of the observation, the classes being known (supervised classification) . The regu- 
larity is obtained by minimizing the length of the discontinuities of the solution, 
i.e. the length of the region boundaries of the classified image. The classification 
problem is then a free discontinuity problem in the sense that the main diffi- 
culty is to capture information on regions (2D data term) and their contours 
(ID discontinuities: regularization term). To reach that goal, two methods have 
been developed for monospectral data [27,26]. One is based on T-convergence 
results, firstly derived for fluid mechanic problems. The second one uses level set 
approach to model the regions and contours of each class. 

Usually for real multiband applications, as for instance SPOT data in XS 
mode composed of three images of the same scene with different wave lengths, 
are used for classification purpose. In this work, we propose an extension of both 
models to multiband data, taking into account coupled information from these 
bands. 

After setting the classification hypotheses and some notations, we state the prop- 
erties of the solution we are looking for, and formalize the classification problem 
through the minimization of a functional. Depending on the variable we use to 
model a classified image, we obtain two different functionals. The minimum of 
both functionals being impossible to reach directly (by a gradient descent for 
example), we propose, as in the monospectral case, approximate functionals to 
reach the minima. Results are finally given on satellite images and compared 
with those obtain by using a hierarchical stochastic model, recently proposed in 
[ 11 ]. 

2 Problem Statement 

2.1 Hypothesis and Notation 

The considered problem is based on partitioning the image into different areas 
(i.e. different classes), each area being characterized by a feature. The feature 
criterion we are interested in is the spatial distribution of the intensity. Of course, 
other discriminant features than intensity can be used by considering suitable 
parameters (texture parameters for example). Within this framework, a class 
is characterized by parameters of the spatial distribution of intensity, i.e. the 
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mean and standard deviation for Gaussian hypothesis (covariance matrix in 
the multispectral data case). This work takes place in the general framework 
of supervised classification which means that the number and the parameters 
of the classes are known. These values are either given by an expert or are 
pre-computed by a fuzzy Cmeans algorithm with an entropy term (see [18] for 
instance) . 

General notation: fi is an open bounded subset of . Observed data are rep- 
resented by the function / : 1? — )• K for the monospectral case and I : fi ^ 
that is / = [I^ , for the multispectral case. P is the number of bands. 

We assume I G L^(f2,K^). 

Classification notation: K is the number of classes. We assume that the number 
K is the same in each band, each one representing the same scene in different 
spectral domains. The K classes are characterized by Gaussian parameters as 
follows: 

= {l4}p=^-P s-nd E, for i = l,...,K. 

fii is the mean vector of the class, containing the mean for each band p, and Ei 
is the Px P covariance matrix defined by if™” = cow(/™(x), /”(.x)), x in class i. 
E^^ is the variance associated to the class i for the band p. This matrix represents 
a measure of the correlation between bands. A diagonal matrix E^ means that 
for the class i, information in the different bands is totally decorrelated. 



2.2 Variational Problem Formulation 



As we assume that the repartition of the observed intensity has a Gaussian 
distribution within each classes, we have chosen to represent each class by its 
mean (variances will be take into account later in the algorithms). So the label 
of a class will be its mean in such a way the value of the classes constitutes an 
approximation of the observed data I (in the norm sense) . Therefore our goal 
is to find an image f such that f{x) = \nii,x)pLi where Qi is the set of 

pixels in class i and where the family {Qi)i=i..K forms a partition of 1?. f must 
be close to the observation I in the norm sense. A third condition is added 
in order to get a regular solution: / should have minimal length discontinuity 
curves, or equivalently {f2i)i^i..K should have minimal length interfaces. 

In order to define a solution in a variational approach, we define a functional 
which the minimum has the desired properties. The unknown solution can be 
modelized either by using a function /:!?—> such that at the infimum 
/ takes its values in the set mfi = I..K, or by using sets {i^t)i=i..K such that 
= {x G 17,x is in class i}, and {Ci)i=i..K form a partition of 17. In the first 
approach, the functional is defined as 



j{f) = [ f{x)-nx) 

JQ 



data term 



dx+ U^{Sf) 

regularization term 



and /(x) G {/xi, ../lic}, Vx G 17 



( 1 ) 
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|.|rp being a norm in V} denotes the ID-Hausdorff measure and Sf is the 
set of jumps of / (which identifies to the set of discontinuities of / up to a 
H^-negligible set), see [13]. 

In the second approach, the functional is defined as 

K 

= V/ [I - - ^li]dx + 

^=1 ^ 

data term regularization term 

and {^i}i=i..K is a partition of Q (2) 

where Fi = dfli n Q. jr^l = ’H^{Ft) is the one-dimensional measure of Hausdorff 
of the set F^. 

Despite of the fact that the covariance matrix 27* is not taken into account in 
the expression of J{f) in (1) (it will be introduce later), both functionals J(/) in 
(1) and J(l7i, 17 j^) in (2) define same constraints on the solution. Minimizing 
(1) w.r.t / or (2) w.r.t 17, is a difficult task since the functionals involve terms of 
different nature (ID versus 2D terms) and the unknown in (2) are sets and not 
functions. For each case and for monospectral data, we have developed a method 
that overcome this difficulty (see [27] and [26]). We present here the methods 
for multispectral data. 




3 Classification with Restoration 



Computing a minimum for the functional (1) is difficult due to the presence 
of ID regularization term which acts on the discontinuity set of the unknown 
variable /. Based on mathematical results of F-convergence for fluid mechanic, 
we have proposed in [27] to minimize a sequence of functionals. The extension 
to multispectral data give the following sequence of functionals 



Je{f)= f f{x)-I{x) dx+eX^ [ ip{\'Vf{x)\iiip)dx+— [ W{f{x))dx, 

Jn Jo £ Jn 



data term 



restoration term 



classification 



( 3 ) 



and the associated problem consists in finding /o such as: 



/o = 



lim argmin,/e(f) . 



( 4 ) 



The parameters A and rj are fixed, e varies. Let us first consider the functional 
with a fixed e. The first two terms of (3) are standard for noisy image restoration 
by anisotropic regularization [10,21]. Function Lp \s a, smoothing function that 
will be defined later. 

The third term of (3) is a level constraint such that W : — > K+ attracts 

the values of f{x) towards the means m of classes i, taking into account the 
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2 

I 




^2 ^ 



Fig. 1. Vector-valued potential W projected on plane (1^,1^) in case of 2 bands and 
3 classes. 



covariance W has K minima for the values Hi such that W{^i) =0, Mi. 
W is quadratic in around each minima (from the Gaussian distribution 
hypothesis), as illustrated on Figure 1. W is defined by 

W(f(x)) = min J/(x) - ^i]'^Er^lf(x) - ^i] (5) 

2=ly...,K 

where is the inverse of the covariance matrix. We can remark that such a 
W potential is not differentiable at the junction points of the multidimensional 
parabola. We have not noticed, in the experiences conducted till now, instabil- 
ities which could be due to this non differentiability points. However, it should 
be interesting to construct smoother potential W by building junctions. 

Considering a sequence of criteria Jg when £ — >■ 0 is inspired from works 
conducted in the Van der Waals-Cahn-Hilliard theory framework for phase tran- 
sitions in fluid mechanic [2,29]. 

The minimization problem (4) relies upon T-convergence arguments [2,29]. 
If ip(t) = then we can show from [2] that the sequence of functionals (3) 
r -converges to 

f Mf) = \fj-i - i\lpdx + 

< a f £ BV{Q), and W{f{x)) = 0 a.e. (6) 

['^o(/) = +0O, otherwise 

where F^j is the interface between regions Qi and Qj and BV{n) is the space of 
functions of bounded variation [13]. The distance d is defined by: 

= infg^^J ^/W{g{s))\g' {s)\ds; 5 G C^([0; 1], K^) 

and g>0, g{0) = g{l) = fij | (7) 

From the F -convergence theory we know that the sequence of minimizers fg 
of Jg(/) converges (up to a subsequence) to a minimizer of Jq. Moreover, /o has 
the form 
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K 

fo{x) = 






(8) 



where ')(q. is the characteristic function of So /o defines a partition of J? 
according to the predefined classes, with minimal interfaces with respect to the 
weighted length (7). Notice that we have not exactly reached the desired mini- 
mum of (1) since the regularization term is a weighted distance in Jo{f). 

From the numerical point of view, when e varies, the functional turns from 
a restoration process (the third term in (3) is negligeable) into a classification 
process. 

We have used an edge-preserving regularizing function f{t) = because 
it numerically gives better results, by preserving high gradients which represent 
edges [10]. The theoretical justification of the F-limit using this function rather 
the quadratic one is under consideration. 



For the numerical minimization of Jg for a fixed e, we use the half-quadratic 
decomposition of functions (f as in [10,15], in order to avoid nonlinearities in the 
Euler-Lagrange equations associated to the minimization of (3). By introducing 
an auxiliary variable 6 : — > K, the minimization of (3) is replaced by the 

minimization of Jg*(/, 6) with respect to (f,b), with: 



Je*if 



,b)= [ 

Jn 



f 


f{x) - I{x) 


dx T <5A [ 


Jn 







b{x)\Vf{x)\lp + ip{b{x)) 



dx 



+ - / W{f{x))dx, (9) 
Jq 



where ip is a convex function with respect to 6, defined from ip. Then for each 
fixed £ value, we solve alternately the Euler-Lagrange equations 



b = ip' H-|V/|rp) & R with f Gxed, 

f ~ X^sdiv{bV f) + (/) = I, with b fixed. 



( 10 ) 



For each £ a solution is computed, which is used to initialize the system (10) for 
a new value of s. 

Remark on the calculus of |V/|rp 

In order to use inter-band informations for the restoration process, we can replace 
in equations (10) the divergence term div{bV f) by div{^J\j^ — A_0+) where 
A+ and A_ are respectively the highest and the lowest eigenvalues of the first 
fundamental form matrix of /, and 0+ is the eigenvector associated to A+, that is 
the direction of the highest slope. This gives an anisotropic smoothing of / along 
the direction of highest variation. Such smoothing has been first introduced in 
[33], and also used in [6,28,32]. 
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4 Classification by a Dynamical Model 

In the second approach, the unknown variables are the sets and the 

classification problem is stated as a partitioning problem according to the pre- 
defined classes. Let 17^ be the region defined as 

f?i = {x e Qjx belongs to the i‘^ class}. (11) 

A partitioning of 1? consists of finding a set {f2i}t=i...K such that 

K 

/? = IJ IJ c) and 17, p| Q, = 0. (12) 

*=i 

where Fi = 917^017. The interface between 17^ and fij is denoted by Fij = Fj^ = 
n Fj, Vi ^ j. We have T, = -Cj- |C| = 71^(1}) is the one-dimensional 
measure of Hausdorff of the set satisfying 

Id ^ \Fjj \ with |0| = 0. (13) 

j¥=i 

We are looking for sets d such that {d}i=i...ar is a partition of 12, and such 
that the 12, ’s define a classification of the observed data J, taking into account 
the gaussian distribution property of the classes (data term). We also impose 
that the partition is regular in the sense that the sum of the length of interfaces 
d is minimum. The functional to minimize is then the one proposed in (2). 

In order to change the minimization problem w.r.t the sets 12, into a mini- 
mization problem w.r.t. functions, we have proposed to the use level set method 
(see [26] for the monospectral data case). Let : 17 — > K be a Lipschitz function 
associated to the region 12, such that <?,(x) > 0 if x G 12, , ^i(a^) = 0 if x G T, 
and ^i(x) < 0 otherwise. 

Thus, the region 12j is entirely described by the function The resulting 
model proposed in [26] is inspired from the work of Zhao et al. about multi- 
phase evolution [34] , and takes place in the general framework of active contours 
[8,9,16,19] for region segmentation [24,35]. For the multispectral data, we pro- 
pose to minimize 

K 

d(^i,...dx) = Ve, / - ^H\^ S-\I - ^H\dx 

^=l 



dx (14) 



minimization of contour length partition constraint 



K 

E 

2=1 



mutlispectral data term 
K 



[ S^{<P,)\VF,\dx f 



where Fla and 5a are smooth approximations of respectively the Heaviside 
function F[ and the Dirac 5 distribution. The parameters e,, and A are positive 
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real fixed numbers. The partition constraint is introduced thanks to the third 
term in (14). As a — > 0+ , this term penalizes the formation of vacuum (pixels 
with no label) and regions overlapping (pixels with more than one label). 

By using the coaera formula, we show that [26] 



lim 

a — >-0 



5a{'^^{x))\W$i{x)\dx = [Tij. 



The limiting functional, when a ^ 0, is: 



K 

Fo(^>i,...,^>k) = Ve, / [I - tuY - ^li]dx + 

*=i 



data term 



+ y/„ (£"«)- !)"■'■' 

i=l 

^ V 

partition constraint (cond. A) 






minimization of contour length 



(15) 



As a — >■ 0+, the solution set minimizing F„(<?i, ..., <?x)i if it exists, de- 

fines a classification compound of homogeneous classes (the so-called 17* phases) 
separated by regularized interfaces, as defined in (2). 

Based on numerical results, we have introduced a weighted length term, in 
order to get improved results. Let g :R ^ K+ defined by (;(|V/|r) = 
so that it vanishes around high gradients of I. This a standard stopping function 
used in active contours. This stopping function is introduced in the length term 
in order to enforce the contours of the classification to be stopped around the 
high gradients of the data I. The functional we minimize is then 

K 

F„(<?i,...,<?k) = Ve, / - imY S r\l - ij,,]dx 

i=i 

' V ' 

mutlispectral data term 

K . \ r ^ 

+ / f7(|V/|)(f„(^>,)|V<f>,|dx +- / (16) 

1=1 ^ ^ 1=1 

V* "V 

minimization of contour weighted length partition constraint 



To minimize the functional (16), we derive the K Euler -Lagrange equations 
with respect to the K functions $i. These equations are embedded in a dynamical 
scheme, where the variable t is the time parameter: 



^t+i ^ 



- fil -7.9(1 V/|)dir(^) 

!;«»(■!>■) -i)]} 



VgV<d>\ , 

-7i — h A 



|V^>( 



(17) 
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Fig. 2. SPOT Multispectral data (XS3 band) of Lannion bay (August 1997) after 
histogram equalization. 

The K PDEs are coupled through the partition term ^ l) • The 

evolution of each is guided by three different forces constraining the solution 
to be a regular partition of 1? according to the classification of the data. The 
functions are defined as signed distance functions to the zero level set. In 
order to preserve the constraint \V<Pi\ = 1 (cf. [3,14]) during the algorithm, the 
functions are regularly updated by using the PDE proposed by Sussman et 
al. [30]: |V^,|). 

The parameters are tuned by trial and error according to the dimension and 
morphology of the regions (for the weights of the length term), and according 
to the noise (for the weights of the data term). The functions are automati- 
cally initialized by dividing the image support 1? into Nw windows Wn^n=i..Nw 
with a fixed size. In each window, we compute the mean rUn and the standard 
deviation an and we search for the nearest class, that is the k index such that 
k = argmirij where ds is the Bhattacharyya distance [5] 

which allows to measure the distance between two Gaussian distributions. We 
do not take into account the correlation between bands for this initialization. 
Let us remark that the smallest object which could be detected is linked to the 
size of the initial windows Wn- 

5 Results on SPOT Data 

We present some results on SPOT data of Lannion Bay in France (see Figure 2) 
provided from SPOTIMAGE. These data have been used to study and measure 
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Table 1. For each class: number of mis classified pixels and rate of success according 
to the algorithm. Last line is the total of pixels used for the ground truth (GT). 

CLASSES 



1 


2 


3 


4 


5 


6 


7 


8 


■ 


■ 


■ 


■ 


■ 


■ 


■ 


■ 


sea 


sand 


urban 


woods 


grasslands 


pas tur elands 


vegetables 


corn 


water 


uncovered grounds 


aeras 


heath 











total 



MV 


16 


77 


119 


1 


54 


182 


5 


58 


512 




(96%) 


(76%) 


(63%) 


(93%) 


(49%) 


(4%) 


(0%) 


(56%) 


(65%) 


ICM-NO 


16 


73 


103 


1 


50 


189 


5 


61 


498 




(96%) 


(77%) 


(68%) 


(93%) 


(53%) 


(1%) 


(0%) 


(54%) 


(66%) 


H-MAP 


16 


76 


63 


1 


41 


177 


5 


54 


433 




(96%) 


(76%) 


(81%) 


(93%) 


(61%) 


(7%) 


(0%) 


(59%) 


(71%) 


Ml 


0 


79 


93 


2 


52 


146 


4 


44 


421 




(100%) 


(75%) 


(71%) 


(86%) 


(51%) 


(23%) 


(20%) 


(67%) 


(71%) 


M2 


0 


105 


101 


9 


41 


160 


5 


19 


440 




(100%) 


(67%) 


(69%) 


(36%) 


(61%) 


(16%) 


(0%) 


(86%) 


(70%) 


pix. nb for GT 


376 


320 


323 


14 


106 


190 


5 


132 


1466 



how intensive culture changes ground use in this area. A complete study has been 
made during the PhD thesis of Annabelle Chardin [11] conducted at IRIS A in 
the VISTA research group She developed a hierchical Markov Random Field 
model for classification (called H-MAP) and compared her results with those 
obtained by a Maximum Likelihood (ML) estimator (classification without any 
restoration) or with those obtained by the restriction of her algorithm at the 
highest level (called ICM NO). (Ml) (resp. (M2)) stands for the first (resp. sec- 
ond) variational model presented in this paper. We also have ground truth given 
by geographers from COSTEL of Rennes University. The number of classes has 
been fixed by these experts. They have also chosen small rectangular areas for 
the estimation of the parameters of the classes and others small rectangular areas 
for validation with ground truth. The original image is 1480x1024 pixels, and 
we present results on a portion 400x400 pixels as shown in Figure 2. The clas- 
sified images are presented on Figure 3 with the corresponding legend presented 
in Table 1. Numerical results according to ground truth available on small areas 
of the image are also listed in Table 1. 

6 Comments and Conclusion 

By visualizing images on Figure 3, we can see that the ML estimator (classifi- 
cation without restoration) results in a noisy image with a lot of small regions 
which give nonhomogeneous areas. On the contrary, the visually smoothest re- 
sult is the one given by applying (M2). This is mainly due to the initialization 

^ We thank the VISTA research group to have so kindly let these data be available 
for comparison with our methods. 
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ML 



H-MAP 




Ml M2 

Fig. 3. Results of classification by ML, H-MAP, Ml and M2 (color image). 



process where the regions are initialized on windows of 5 x 5 pixels, limiting the 
spatial resolution of the model. This is not due to the regularization term (length 
minimization) because we have some points with high curvature. On (H-MAP) 
image, we note that contours are mainly horizontal and vertical because of the 
first order Markov random field (Potts model) used in this approach [11]. 

Looking at the numerical results in Table 1, the first remark is that classes 4 
and 7 are not significant because the number of pixels in the ground truth is too 
small (14 and 5 respectively). The second remark is that we have mixed classes 
between classes 5 and 6, which probably explain the bad results obtained for 
these classes. For the class 3 (urban aeras), we should have taken into account 
texture parameters rather than grey level values for the classification. It seems 
that the hierarchical model (H-MAP) is more robust with respect to this bad 
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modeling. Model (M2) gives smaller performance for class 2 mainly because it is 
compound of small regions which are not catched at the initialization procedure 
(5x5 pixel windows). This model however gives better results for class 8. 

It is difficult to give a general conclusion with respect to the use of these different 
approaches for classification. Firstly, the number of pixels for the ground truth 
is too small to analyse results. Notice that the selected areas for estimating the 
class parameters (different from the previous ones) are sufficient. Secondly, we 
should also need the evaluation of experts on the final classified images. Thirdly, 
we should also take into account the computational time for the comparison (for 
more details see [25]). 
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Abstract. After [10,15,12,2,4] minimum cut/maximum flow algo- 
rithms on graphs emerged as an increasingly useful tool for exact or 
approximate energy minimization in low-level vision. The combinatorial 
optimization literature provides many min-cnt /max- flow algorithms with 
different polynomial time complexity. Their practical efficiency, however, 
has to date been studied mainly outside the scope of computer vision. 
The goal of this paper is to provide an experimental comparison of the 
efficiency of min-cnt /max flow algorithms for energy minimization in vi- 
sion. We compare the running times of several standard algorithms, as 
well as a new algorithm that we have recently developed. The algorithms 
we study include both Goldberg-style “push-relabel” methods and algo- 
rithms based on Ford-Pulkerson style augmenting paths. We benchmark 
these algorithms on a number of typical graphs in the contexts of im- 
age restoration, stereo, and interactive segmentation. In many cases our 
new algorithm works several times faster than any of the other methods 
making near real-time performance possible. 



1 Introduction 

Greig et. al. [10] were first to discover that powerful min-cut/max-flow algorithms 
from combinatorial optimization can be used to minimize certain important en- 
ergy functions in vision. The energies addressed by Greig et. al. and by most later 
graph based methods (e.g. [15,12,2,11,4,1,18,13,16,17,3,14]) can be represented 
as a posterior energy in MAP-MRF^ framework: 

E{L) = (1) 

p£V {p,q)eJ^ 

where L = {Lp \p e V} is a labeling of image V, Dp{-) is a data penalty function, 
Vp^q is an interaction potential, and TV is a set of all pairs of neighboring pixels. 
Papers above show that, to date, graph based energy minimization methods 
provide arguably the most accurate solutions for the specified applications. 

^ MAP-MRF stands for Maximum A Posterior estimation of a Markov Random Field. 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 359-374, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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Greig et.al. constructed a two terminal graph such that the minimum cost cut 
of the graph gives a globally optimal binary labeling L in case of the Potts model 
of interaction in (1). Previously, exaet minimization of energies like (1) was not 
possible and such energies were approached mainly with iterative algorithms 
like simulated annealing. In fact, Greig et.al. used their result to show that in 
practice simulated annealing reaches solutions very far from the global minimum 
even in very simple image restoration examples. 

Unfortunately, the result of Greig et.al. remained unnoticed for almost 10 
years mainly because the binary labeling limitation looked too restrietive. In the 
late 90’s new computer vision techniques appeared that used min-cut/max-flow 
algorithms on graphs. [15] was the first to use these algorithms to compute multi- 
camera stereo. Later, [12,2] showed that with the right edge weights on a similar 
to [15] graph one can minimize the energy in (1) for linear interaction penalties. 
The exact minimum could be computed when there are more than two labels. 
The results in [2,4] showed that iteratively running min-cut/max-flow algorithms 
on appropriate graphs can be used to find provably good approximate solutions 
for even more general multi-label case when interaction penalties are metrics. 

A growing number of publications in vision use graph based energy minimiza- 
tion techniques for applications like image segmentation [12,18,13,3], restoration 
[10], stereo [15,2,11,14], shape reconstruction [16], object recognition [1], aug- 
mented reality [17], and others. The graphs corresponding to these applications 
arc usually huge 2D or 3D grids, and min-cut/max-flow algorithm efficiency is 
an issue that can not be ignored. 

The goal of this paper is to compare experimentally the speed of several min- 
cut/max-fiow algorithms on graphs typical for applications in vision. In Section 2 
we provide basic facts about graphs, min-cut and max-flow problems, and some 
standard combinatorial optimization algorithms for them. Section 3 introduces a 
new min-cut/max-fiow algorithm that we developed while working with graphs 
in vision. In Section 4 we tested our new algorithm and three standard min- 
cut/max-fiow algorithms: H_PRF and Q_PRF versions of Goldberg-style “push- 
relabel” method [9,5], and the Dinic algorithm [7]. We selected several examples 
in image restoration, stereo, and segmentation where different forms of energy 
(1) are minimized via graph structures originally described in [10,12,2,4,14,3]. 
Such (or very similar) graphs are used in all computer vision papers known to 
us that use graph cut algorithms. In many interesting cases our new algorithm 
was significantly faster than the standard min-cut/max-fiow techniques from 
combinatorial optimization. More detailed conclusions are presented in Section 5. 

2 Background on Graphs 

In this section we review some basic facts about graphs in the context of energy 
minimization methods in vision. A graph Q = (V, £) consists of a set of nodes V 
and a set of directed edges £ that connect them. Usually the nodes correspond 
to pixels, voxels, or other features. A graph normally contains some additional 
special nodes that are called terminals. In the context of vision, terminals cor- 
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Fig. 1. Example of a graph. Edge costs are reflected by their thickness. This graph 
construction was first used in Greig et. al. [10]. 



respond to the set of labels that can be assigned to pixels. We will concentrate 
on the case of graphs with two terminals. Then the terminals are usually called 
the source, s, and the sink, t. In Figure 1(a) we show a simple example of a two 
terminal graph (due to Greig et. al. [10]) that can be used to minimize the Potts 
case of energy (1) on a 3 x 3 image with two labels. There is some variation in 
the structure of graphs used in other energy minimization methods in vision. 
However, most of them are based on regular 2D or 3D grid graphs as the one 
in Figure 1(a). This is a simple consequence of the fact that normal images (or 
volume data) in vision have grid-like structures. 

All edges in the graph are assigned some weight or cost. A cost of a directed 
edge (p, g) may differ from the cost of the reverse edge (g,p). Normally, there 
are two types of edges in the graph: n-links and t-links. N-links connect pairs 
of neighboring pixels or voxels. Thus, they represent a neighborhood system in 
the image. Cost of n-links corresponds to a penalty for discontinuity between 
the pixels. These costs are usually derived from the pixel interaction term Vp g 
in energy (1). T-links connect pixels with terminals (labels). The cost of a t-link 
connecting a pixel and a terminal corresponds to a penalty for assigning the 
corresponding label to the pixel. This cost is normally derived from the data 
term Dp in the energy function (1). 



2.1 Min-cut and Max-flow Problems 

An s/t cut (or just a cut) C on a graph with two terminals is a partitioning of 
the nodes in the graph into two disjoint subsets S and T such that the source 
s is in S and the sink t is in T- Figure 1(b) shows one example of a cut. In 
combinatorial optimization the cost of a cut C = {iS, T} is defined as the sum of 
the costs of “boundary” edges (p, g) where p £ S and q &T- The minimum cut 
problem on a graph is to find a cut that has the minimum cost among all cuts. 

One of the fundamental results in combinatorial optimization is that the 
minimum s/t cut problem can be solved by finding a maximum flow from the 
source ,s to the sink t. Loosely speaking, maximum flow is the maximum “amount 
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of water” that can be sent from the source to the sink by interpreting graph edges 
as directed “pipes” with capacities equal to edge weights. The theorem of Ford 
and Fulkerson [8] states that a maximum flow from s to t saturates a set of edges 
in the graph dividing the nodes into two disjoint parts {5, T} corresponding to 
a minimum cut. Thus, min-cut and max-flow problems are equivalent. In fact, 
the maximum flow value is equal to the cost of the minimum cut. 

We can also intuitively show how min-cut (or max-flow) on a graph may 
help with energy minimization over image labelings. Consider an example in 
Figure 1. The graph corresponds to a 3 x 3 image. Any s/t cut partitions the 
nodes into disjoint groups each containing exactly one terminal. Therefore, any 
cut corresponds to some assignment of pixels (nodes) to labels (terminals). If 
edge weights are appropriately set based on parameters of an energy, a minimum 
cost cut will correspond to a labeling with the minimum value of this energy^. 

2.2 Standard Algorithms in Combinatorial Optimization 

An important fact in combinatorial optimization is that there are polynomial 
algorithms for min-cut /max-flow problems on graphs with two terminals. These 
algorithms can be divided into two main groups: Goldberg-style “push-relabel” 
methods and algorithms based on Ford-Fulkerson style augmenting paths. 

Standard augmenting paths based algorithms, such as Dinic algorithm, work 
by pushing flow along non-saturated paths from the source to the sink until the 
maximum flow in the graph Q is reached. A typical augmenting path algorithm 
stores information about the distribution of the current s ^ t flow / among the 
edges of Q using a residual graph Qj. The topology oi Qj is identical to Q but 
capacity of an edge in Qj reflects the residual capacity of the same edge in Q 
given the amount of flow already in the edge. At the initialization there is no 
flow from the source to the sink (f=0) and edge capacities in the residual graph 
Qo are equal to the original capacities in Q. At each new iteration the algorithm 
finds the shortest s — ^ t path along non-saturated edges of the residual graph. 
If a path is found then the algorithm augments it by pushing the maximum 
possible flow df that saturates at least one of the edges in the path. The residual 
capacities of edges in the path are reduced by d,f while the residual capacities 
of the reverse edges are increased by df. Each augmentation increases the total 
flow from the source to the sink / = f + df. The maximum flow is reached when 
any s ^ t path crosses at least one saturated edge in the residual graph Qf. 

Dinic algorithm uses breadth-first search to find the shortest paths from s 
to t on the residual graph Qf. After all shortest paths of a fixed length k are 
saturated, the algorithm starts the breadth-first search for ,s — >■ f paths of length 
k + 1 from scratch. Note that the use of shortest paths is an important factor 
that improves running time complexities for algorithms based on augmenting 
paths. The worst case running time complexity for Dinic algorithm is O(mn^) 
where n is the number of nodes and m is the number of edges in the graph. 

^ Different graph based energy minimization methods may use different graph con- 
structions, as well as, different rules for converting graph cuts into image labelings. 
Details for each method are described in the original publications. 
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Push-relabel algorithms use quite a different approach. They do not maintain 
a valid flow during the operation; each node may have a positive “flow excess” , 
and the algorithm tries to push it to neighboring nodes. Push-relabel techniques 
are harder to describe in just a few sentences and we would rather refer the 
reader to our favorite text-book on basic graph theory and algorithms [6] . 

For our experimental tests on graph-based energy minimization methods in 
vision we selected the following standard algorithms. 

DINIC: Algorithm of Dinic [7]. 

H_PRF: Push-Relabel algorithm [9] with the highest level selection rule. 
Q_PRF: Push-Relabel algorithm [9] with the queue based selection rule. 

Many previous experimental tests, including the results in [5], show that the 
last two algorithms work consistently better than a large number of other min- 
cut/max-flow algorithms of combinatorial optimization. The theoretical worst 
case complexities for these “push-relabel” algorithms are 0{n^) for Q_PRF and 
0{n^^) for H_PRF. 

3 New Min-cut/Max-flow Algorithm 

In this section we present a new algorithm that we developed while working with 
graphs that are typical for energy minimization methods in computer vision. The 
algorithm presented here belongs to the group of algorithms based on augment- 
ing paths. Similarly to DINIC it builds the search tree for finding augmenting 
paths but it reuses this tree and never starts building it from scratch. The draw- 
back of our approach is that the augmenting paths found are not necessarily 
shortest augmenting path; thus the time complexity of the shortest augmenting 
path is no longer valid. The trivial upper bound on the number of augmentations 
for our algorithm is the cost of the minimum cut ICI, which results in the worst 
case complexity 0(mn'^\C\). Theoretically speaking, this is worse than complex- 
ities of the standard algorithms discussed in Section 2.2. However, experimental 
comparison in Section 4 shows that on typical problem instances in vision our 
algorithm significantly outperforms standard algorithms. 

3.1 Algorithm’s Overview 

We maintain a search tree S with the source as a root where all edges from each 
parent node to its children are non-saturated. The nodes that are not in S are 
called “free” . The set of free nodes is denoted T. The nodes in the search tree S 
are divided into “active” and “passive”. The active nodes may “grow”, that is, 
they may acquire new children from a set of free nodes. The passive nodes are 
guaranteed to have no free neighbors connected through non-saturated edges. 
Thus, the passive nodes can not grow. 

The algorithm iteratively repeats the following three stages: 

— “growth” stage: the search tree grows until the sink is found 
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— “augmentation” stage: the path found is augmented, the search tree is broken 
into a forest. 

— “adoption” stage: the forest is transformed back into a tree. 

At the growth stage the search tree expands. The active nodes acquire new chil- 
dren from a set of free nodes. The newly acquired nodes become active members 
of the search tree S. As soon as all neighbors of a given active node are explored 
the active node becomes passive. The growth stage terminates when the sink is 
encountered and, thus, a path from the source to the sink is found. 

The augmentation stage augments the path found in the growth stage. Since 
we push through the largest flow possible some edges in the path become satu- 
rated. Thus, some of the nodes in the tree become “orphans” , that is, the edges 
linking them to their parents are no longer valid (they are saturated). In fact, 
the augmentation phase splits the search tree S into a forest. The source is still 
a root of one of the trees in the forest and the orphans form roots of other trees. 

The goal of the adoption stage is to restore a single search tree structure with 
a root in the source. At this stage we try to find a new valid parent for each 
orphan. If there is no such parent we remove the orphan from S and make it a 
free node. We also declare all its former children orphans. The stage terminates 
when no orphans are left and, thus, the search tree structure of S is restored. 
Since some orphan nodes in S may become free the adoption stage results in 
contraction of the set S. 

After the adoption stage is completed the algorithm returns to the growth 
stage. The algorithm terminates when the search tree can not grow (all active 
nodes checked their neighbors and became passive) while the sink is not found. 



3.2 Details of Implementation 

Assume that we are given a directed graph Q = {V,£). As for any augmenting 
path algorithm, we will maintain a flow / and the residual graph Gf (see Sec- 
tion 2.2). For each node p we will store its parent as PARENT{p). Roots of the 
forest (the source and the orphans) as well as all free nodes have no parents, 
t.e. PAR.ENT{jp) = 0. We will also keep the lists of all active nodes. A, and all 
orphans, O. The general structure of the algorithm is: 

initialize: S = A = {s}, T = V — {s}, 0 = 0 
while true 

grow S to find an augmenting path P from s to t 
if P = 0 terminate 
augment on P 
adopt orphans 

end while 

The details of the growth, augmentation, and adoption stages are described below. 



Growth Stage: At this stage active nodes acquire new children from a set of 
free nodes. 
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if t (E S return P = PATHg^t 
while A 7 ^ 0 

pick an active node p (E A 

for every non-saturated edge (p,q) 

if q £ T add q to the search tree as an active node: 
S':=5'U{g}, A:=AU{g}, PARENT{q) —p 
if q = t return P = PATHg^t 
end for 

remove p from A 
end while 
return P = 0 



Augmentation Stage: The input for this stage is a path P from s to t. Note 
that the orphan set is empty in the beginning of the stage, but there will be 
some orphans in the end since at least one edge in P becomes saturated. 

find the bottleneck capacity A on P 

update the residual graph by pushing flow A through P 
for each edge (p, g) in P that becomes saturated 
set PARENTiq) := 0 
add q to O 
end for 

Adoption Stage: During this stage all nodes in O are processed until O be- 
comes empty. The node being processed tries to find a new parent in S'; in case 
of success it remains in S but with a new parent, otherwise it is removed from 
S to the set of free nodes T and all its children are added to O. 

while O yf 0 

pick a node p (E O 
remove p from O 
process p 
end while 

The operation “process p” consists of the following steps. First we are trying 
to find a new parent for p. For each non-saturated edge (g,p) entering p we check 
whether g is a valid parent. Two conditions should hold for g: 

— g should be in S 

— the “origin” of g should be the source 

Note that it is necessary to check the second condition because some of the nodes 
in S originate from orphans. 

If a new parent q is found, then p remains in S with g as its parent. The 
active (or passive) status of p in 5 remains unchanged. If p does not find a valid 
parent in S then the following three operations are performed: 

— p is removed from S (and A) and becomes a free node in T 
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— for all children g of p we set PARE NT {q) = 0 and add them to the set of 
orphans O 

— all “potential” parents of p (nodes g in S' such that the edge (g,p) is not 
saturated) are added to the active set A 

The last operation is necessary to make sure that no passive node in S connects 
to a free neighbor through a non-saturated edge. Only active nodes are allowed 
to have such free neighbors. Suppose that an orphan p becomes free. Without 
the last operation, the passive neighbors of p in S connected to p via non- 
saturated edges would remain passive while they should not. At that moment 
these neighbors did not qualify as valid parents for p because they originated 
from other orphans and not from the source. After the search tree is fixed one 
of such neighbors may potentially become a new parent of p. 



3.3 Correctness Proof 

Let’s introduce some invariants which are maintained during the execution of 
the algorithm. 

11 5 is a forest with roots at either the souree or orphans. 

12 Edges from a parent to ehildren in the seareh forest have nonzero residual 
capacities. 

13 There are no orphans during the growth stage. 

14 For passive nodes p in S the following property should be true: for all non- 
saturated edges {p, q) the node q must belong to S. 

These invariants are clearly true at the initialization of the algorithm. It is 
easy to see these invariants directly follow from the construction of the algorithm. 

Let’s show that all stages terminate. The growth stage terminates because 
the number of nodes is finite. The same argument applies to the augmentation 
stage. Now we prove that the adoption stage is also finite. Note that after a node 
p in O has been processed it can not become an orphan again during the same 
adoption stage (it will imply that the adoption stage terminates after processing 
at most n nodes). Indeed, if p is moved from S' to T then this holds since free 
nodes in T are not involved at the adoption stage. Suppose p found a new parent 
g and remained in S. The new parent g must originate from the source. Thus, 
the source is the new origin of p as well. By construction, only descendants of 
orphans may become orphans during the adoption stage. Therefore, p can not 
become an orphan again at the same adoption stage. 

The algorithm terminates if the number of cycles (augmentations) is finite. 
Since the algorithm is not a shortest path algorithm the polynomial bound for 
the number of augmentations does not seem to be valid. We know only a trivial 
bound given by a minimum cut cost that works if all edge weights are integers. 

It remains to show that when the algorithm terminates it generates the max- 
imum flow. In fact, the search tree S and the set of free nodes T at the end of 
the algorithm give a minimum ,s/t-cut. Suppose the algorithm has terminated. 
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It could only have happened in the growth stage when no active nodes were left 
and t ^ S. S and T are disjoint sets such that S' U T = V, .s C S, and t & T. 
Suppose that the current residual graph contains a non-saturated path from the 
source to the sink that can be used to increase the flow. Then there is a non- 
saturated edge (p, g) going from a node p G S to another node q G T. Since no 
active nodes are left then p is passive. Hence, the invariant 14 does not hold for 
p and we get a contradiction. 

4 Experimental Tests on Applications in Vision 

In this section we experimentally test min-cut/max-flow algorithms for three 
different applications in computer vision: image restoration (Section 4.1), stereo 
(Section 4.2), and object segmentation (Section 4.3). We chose formulations 
where certain appropriate versions of energy (1) can be minimized via graph cuts. 
The corresponding graph structures were previously described by [10,12,2,4,14,3] 
in detail. These (or very similar) structures are used in all computer vision 
applications with graph cuts (that we are aware of) to date. 

Note that we could not test all known min-cut/max-flow algorithms. We 
compare our new algorithm presented in Section 3 and standard algorithms 
of combinatorial optimization introduced in Section 2.2: DINIC, H_PRF, and 
Q_PRF. Many experimental tests, including the results in [5], show that the 
last two algorithms work consistently better than a large number of other min- 
cut/max-flow algorithms of combinatorial optimization. For DINIC, H_PRF, and 
Q_PRF we took the implementations written by Cherkassky and Goldberg [5] 
and modified them to our graph representation. Both H_PRF and Q_PRF use 
global and gap relabeling heuristics. Our algorithm also leaves some choice in 
implementing certain functions. We found that the order of processing active 
nodes and orphans may have a significant effect on the running time. We made 
a tuning and used it in all experiments. 



4.1 Image Restoration 

Here we consider two examples of energy (1) with the Potts and linear models 
of interaction. Graph based methods for minimizing Potts energy were used 
in many different applications including segmentation [13], stereo [2,4], object 
recognition [1], shape reconstruction [16], and augmented reality [17]. Linear 
interaction energy was used for stereo [15] and segmentation [12]. The structures 
of the corresponding graphs are identical in all applications using the same type 
of energy. We chose the context of image restoration mainly for its simplicity. 
The Potts energy that we use for image restoration is 

i^(/) = ^ iFfe,) • T(Jp ^ /,) (2) 

PS'P {p,q)eJ^ 

where I = {Ip \p G V} is a vector of unknown “true” intensities of pixels on 
the image V and 1° = {Ip \p G V} are intensities observed in the original image 
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(a) Diamond restoration 



(b) Original Bell Quad (c) “Restored” Bell Quad 
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(d) Potts energy (e) Linear interactions energy 



Fig. 2. Image Restoration Experiments. 



corrupted by noise. The Potts interactions are specified by penalties for 

intensity discontinuities between pairs of neighboring pixels. Function T(-) is 1 
if the condition inside parenthesis is true and 0 otherwise. In the case of two 
labels the Potts energy can be minimized exactly using the graph cut method of 
Greig et. al. [10]. We consider image restoration with multiple labels where the 
problem becomes NP hard. We use an iterative graph based method in [4] which 
is guaranteed to find a solution within a factor of two from the global minimum 
of the Potts energy. At each iteration [4] computes a minimum cost cut for a 
certain generalization of the graph introduced in [10]. 

Our image restoration experiments with the Potts energy are presented in 
Figure 2(a-c). The sizes of our test images are 100 x 100 (Diamond) and 112 x 136 
(Bell Quad). The number of allowed labels is 215 and 256, correspondingly. The 
running times (in seconds, 333MHz Pentium III) for the Potts energy minimiza- 
tion tests are given in Figure 2(d). These running times represent the first cycle 
of iterations (see [4] for more details). 

We also consider image restoration with “linear” interactions energy: 

E{I) = J2\\Ip-I°p\\+ E ( 3 ) 

P&'P (p,q)eM 

where constants describe the relative importance of interactions between 

neighboring pixels p and q. If the set of labels is finite and ordered then this 
energy can be minimized exactly using either of the two almost identical graph- 
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based methods developed in [12,2]. In fact, both of them use graphs very similar 
to the one introduced by [15] in the context of multi-camera stereo. These meth- 
ods build graphs by consecutively connecting multiple layers of image-grids. Each 
layer corresponds to one label. The structure of the graphs for linear interactions 
energy has one important distinction from the graphs that are currently used to 
minimize other types of energies; the two terminals are connected only to the 
first and the last layers of the graph. This distinction becomes more pronounced 
when the number of labels (layers) is large. Note that allocating computer mem- 
ory for such multi-layered graphs can be problematic even for 2D images. 

The table in Figure 2(e) shows how long it took each min-cut/max-fiow 
algorithm to compute the exact minimum of the linear interactions energy above. 
We used the same Diamond and Bell Quad images as in the Potts energy tests. 
In the tests presented in (e) we varied the number of labels (layers) \C\. The 
experiments show that our algorithm is the fastest when the number of labels 
is relatively small (less than 50) while Q_PRF wins for larger number of labels. 
Note that the number of labels affects the structure of the graphs in [15,12,2]. In 
the Potts energy minimization method in [4] the number of labels changes the 
number of iterations in each cycle but has no effect on the graph structures. 

4.2 Stereo with Occlusions 

Here we describe our tests on examples in stereo. We consider a recent formu- 
lation [14] that takes occlusions into consideration. The problem is formulated 
as a labeling problem. We want to assign a binary label (0 or 1) to each pair 
(p, q) where p is a pixel in the left image and is a pixel in the right image that 
can potentially correspond to p. The set of pairs with the label 1 describes the 
correspondence between the images. The energy of configuration / is given by 

EU)= E D{p,q) + E Cp ■ T{p is occluded in the configuration /) 

/{p,,)=l p€'P 

The first term is the data term, the second is the occlusion penalty, and the 
third is the smoothness term. V is the set of pixels in both images, and J\f is 
the neighboring system consisting of tuples of neighboring pairs {{p,q), {p',q')} 
having the same disparity (parallel pairs). [14] gives an approximate algorithm 
minimizing this energy among all feasible configurations /. In contrast to other 
energy minimization methods, nodes of the graph constructed in [14] represent 
pairs rather than pixels or voxels. 

The tests were done for three stereo examples shown in Figure 3. We used the 
Head pair from the University of Tsukuba, and the well-known Tree pair from 
SRI. To diversify our tests we compared the speed of algorithms on a Random 
pair where the left and the right images did not correspond to each other. 

Running times for stereo examples in Figure 3 are shown in seconds (450MHz 
UltraSPARC II Processor) in the table below. The times are for the first cycle 
of the algorithm, which is where most of the work is done. 
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(a) Left image of Head pair 



(b) Disparity map for Head pair 



(c) Left image of Tree pair 



(d) Disparity map for Tree pair 



(e) Left image of 
Random pair 



(f) Right image of 
Random pair 



(g) Disparity map 
for Random pair 



Fig. 3. Stereo Experiments. The sizes of images are 384 x 288 in (a), 256 x 233 in 
and 100 x 140 in (e,f). The results in (b,d,g) show occluded pixels in black color. 
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4.3 Interactive Object Segmentation 

In this section we describe experimental tests that compare min-cut /max-flow 
algorithms on Interactive Graph Cuts segmentation technique in [3] . The method 
in [3] allows for the segmentation of an object of interest in N-D images/ volumes. 
This technique generalizes the MAP-MRF method of Greig at. al. [10] by incor- 
porating additional hard constraints into the minimization of the Potts energy 

E{L) = DM + Y. M 7^ L,) 

p&V {p,q)eJ^ 



over binary (object /background) labelings of image. The hard constrains come 
from a user placing some object and background seeds. The technique computes 
binary segmentation of N-dimensional image with globally optimal regional and 
boundary properties among all segmentations that satisfy the hard constraints 
(seeds). The details of the corresponding graph construction are given in [3]. 

We tested min-cut/max-flow algorithms on 2D and 3D segmentation exam- 
ples illustrated in Figure 4. We present the original data and the segmentation 
results corresponding to certain sets of seeds. Note that the user places seeds 
interactively. New seeds can be added to correct segmentation imperfections. 
The technique in [3] efficiently recomputes the optimal solution starting at the 
previous segmentation result. 

Figure 4(a-b) shows photo-editing experiment on a picture (200x300 pixels) 
with a group of people around a bell. Other segmentation examples in (c-h) are 
for 2D and 3D medical data. The cardiac MR data in (c-d) was tested in both 
2D (256x256 pixels) and 3D (256x256x13 voxels) cases. In our 3D experiment 
the seeds were placed in only one slice in the middle of the volume. This was 
enough to segment the whole volume “correctly” . The tests with lung CT data 
(e-f) were also made in both 2D (512x512 pixels) and 3D (512x512x5 voxels) 
cases. In (g-h) we tested the algorithms on 2D liver MR data (512x256 pixels). 

The table below compares the running times (in seconds, 600MHz Pentium 
III processor) of selected min-cut/max-flow algorithms for the segmentation ex- 
amples described. Note that these times include only the min-cut/max-flow com- 
putation^. The tests on 3D data are marked by “3D”. To diversify our tests we 
also made a few experiments where inconsistent seeds were placed at random 

® The time it takes the user to place the seeds varies and may depend on image quality, 
object of interest, and the level of desired details. For the experiments in Figure 4 
all seeds were placed within 10 to 40 seconds. 
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Photo Editing 



(a) Bell Photo 



(b) Bell Segmentation 



Medical Data 



(c) Cardiac MR 



(g) Liver MR 



(d) LV Segment 



(f) Lobe Segment 



(h) Liver Segment 



Fig. 4. Segmentation Experiments. 
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places in the image. The corresponding columns in the table are marked as “ran- 
dom” . Meaningless segmentations from these tests are not shown in Figure 4. 



method 


input 


Liver 


Bell 


Lung 


Heart (3D) 


Lung(3D) 


Bell(random) 


Lung(random) 


DINIC 


26 


7 


5 


— 


— 


8 


16 


H_PRF 


3.5 


2 


2.5 


— 


— 


4 


3 


Q_PRF 


2.5 


1 


0.5 


7 


68 


1.5 


1.5 


Our 


0.26 


0.26 


0.16 


2 


20 


1 


2.5 



Note that 3D segmentation required memory efhcient implementations of 
the graph cut algorithms. We made such implementations only for our new 
algorithm and for Q_PRF (which outperformed H_PRF and DINIC in most other 
experiments). H_PRF and DINIC were not tested in 3D segmentation examples. 

5 Conclusions 

We tested a reasonable sample of typical vision graphs. In most examples our 
new min-cut/max-flow algorithm worked 2-10 times faster than any of the other 
methods, including the push-relabel and Dinic algorithms (which are known to 
outperform other min-cut/max-flow techniques). In some cases the new algo- 
rithm made possible near real-time performance of the corresponding applica- 
tions. One noticeable exception was the energy with linear interactions (3). If 
the number of labels for (3) was relatively small (< 50) then our algorithm was 
only marginally the best, while Q_PRF was significantly faster for larger number 
of labels. We also found that our algorithm’s performance was roughly the same 
as Q-PRF in unrealistic examples with “random” inputs. 

Our results also suggest that graphs in vision are a very specific application 
for min-cut/max-flow algorithms. In fact, Q_PRF outperformed H_PRF in most 
of our tests despite the fact that H_PRF is generally regarded as the fastest algo- 
rithm in combinatorial optimization community. Additional experiments showed 
that our algorithm was several times slower than H_PRF on standard (outside 
computer vision) graphs that are used for tests in combinatorial optimization. 
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Abstract. The 2D absolute phase estimation problem, in interferomet- 
ric applications, is to infer absolute phase (not simply modulo-27r) from 
incomplete, noisy, and modulo-27r image observations. This is known 
to be a hard problem as the observation mechanism is nonlinear. In 
this paper we adopt the Bayesian approach. The observation density is 
27T-periodic and accounts for the observation noise; the a priori proba- 
bility of the absolute phase is modeled by a first order noncausal Gauss 
Markov random field (GMRF) tailored to smooth absolute phase images. 
We propose an iterative scheme for the computation of the maximum a 
posteriori probability (MAP) estimate. Each iteration embodies a dis- 
crete optimization step (Z-step), implemented by network programming 
techniques, and an iterative eonditional modes (ICM) step (yr-step). Ac- 
cordingly, we name the algorithm ZttM, where letter M stands for max- 
imization. A set of experimental results, comparing the proposed algo- 
rithm with other techniques, illustrates the effectiveness of the proposed 
method. 



1 Introduction 

In many classes of imaging techniques involving wave propagation, there is need 
for estimating absolute phase from incomplete, noisy, and modulo-2 tt observa- 
tions, as the absolute phase is related with some physical entity of interest. Some 
relevant examples are [1] synthetic aperture radar, synthetic aperture sonar, 
magnetic resonance imaging systems, optical interferometry, and diffraction to- 
mography. 

In all the applications above referred the observed data relates with the ab- 
solute phase in a nonlinear and noisy way; the nonlinearity is sinusoidal and 
it is closely related with the wave propagation phenomena involved in the ac- 
quisition process; noise is introduced both by the acquisition process and by 
the electronic equipment. Therefore, the absolute phase should be inferred ( un- 
wrapped in the interfermetric jargon) from noisy and modulo-27r observations 
(the so-called principal phase values or interferogram) . 

* This work was supported by the Fundaqao para a Ciencia e Tecnologia, under the 
project POSI/34071/CPS/2000. 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 375-390, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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Broadly speaking, absolute phase estimation methods can be classified into 
four major classes: path following methods, minimum-norm methods, Bayesian 
and regularization methods, and parametric models. Thesis [2] and paper [3] 
provide a comprehensive account of these methods. 

The mainstream of absolute phase estimation research in interferometry takes 
a two step approach: in the first step, a filtered interferogram is inferred from 
noisy images; in the second step, the phase is unwrapped by determining the 27 t 
multiples. Path following and minimum-norm schemes are representative of this 
approach (see [1] for comprehensive description of these methods). The main 
drawback of these methods is that the filtering process destroys the modulo-2 tt 
information in areas of high phase rate. 

In a quite different vein, and recognizing that the absolute phase estimation 
is an ill-posed problem, papers [4], [5], [6], [7] have adopted the regularization 
framework to impose smoothness on the solution. The same objective has been 
pursued in papers [8], [9], [10], [11] by adopting a Bayesian viewpoint. Papers 
[8], [9] apply a nonlinear recursive filtering technique to determine the absolute 
phase. Paper [10] considers an InSAR (interferometric synthetic aperture radar) 
observation model taking into account not only the image absolute phase, but 
also the backs cattering coefficient and the correlation factor images, which are 
jointly recovered from InSAR image pairs. Paper [11] proposes a fractal based 
prior and the simulated annealing scheme to compute the absolute phase image. 

Parametric models constrain the absolute phase to belong to a given para- 
metric model. Works [12], [13] have adopted low order polynomials. These ap- 
proaches yields good results if the low order polynomials represent accurately 
the absolute phase. However, in practical applications the entire phase func- 
tion cannot be approximated by a single 2-D polynomial model. To circumvent 
model mismatches, work [12] proposes a partition of the observed field where 
each partition element has its own model. 



1.1 Proposed Approach 

We adopt the Bayesian viewpoint. The likelihood function, which models the 
observation mechanism given the absolute phase, is 27r-periodic and accounts 
for the interferometric noise. The a priori probability of the absolute phase is 
modeled by a first order noncausal Gauss Markov random field (GMRF) [14], 
[15] tailored to smooth fields. 

Papers [8], [9], [10] have also followed a Bayesian approach to absolute phase 
estimation. The prior therein used was a first order causal GMRF. Taking advan- 
tage of this prior and using the reduced order model (ROM) [16] approximation 
of the GMRF, the absolute was estimated with a nonlinear recursive filtering 
technique. Compared with the present approach, the main difference concerns 
the prior: we use a first order noncausal GMRF prior. In terms of estimation, 
the noncausal prior has implicit a batch perspective, where the absolute phase 
estimate at each site is based on the complete observed image. This is in con- 
trast with the recursive filtering technique [8], [9], [10], where the absolute phase 
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estimate of a given site is inferred only from past (in the lexicographic sense) 
observed data. 

To the computation of the MAP estimate, we propose an iterative proce- 
dure with two steps per iteration: the first step, termed Z-step, maximizes the 
posterior density with respect to the field of 2 vr phase multiples; the second 
step, termed 7r-step, maximizes the posterior density with respect to the phase 
principal values. Z-step is a discrete optimization problem solved by network 
programming techniques. 7r-step is a continuous optimization problem solved 
approximately by the iterated conditional modes (ICM) [17] scheme. We term 
our algorithm ZttM, where the letter M stands for maximization. 

The paper is organized as follows. Section 2 introduces the observation model, 
the first order noncausal GMRF prior, and the posteriori density. Section 3 
elaborates on the estimation procedure. Namely, we derive solutions for the Z- 
step and for the 7 r-step. Section 4 presents results. 

2 Adopted Models 

2.1 Observation Model 

The complex envelop of the signal read by the receiver from a given site is given 

by 



X = + n. 



( 1 ) 



where (f> is the phase to be estimated and n is complex zero-mean circular Gaus- 
sian noise. Model (1), adopted in papers [8] and [9], applies, for example, to laser 
interferometry [18]. 

Defining = £'[|n|^], the probability density function^ of x is (see, e.g., [19, 
ch. 3]) 



Px\A^\<l>) = — ^exp I 



IX - e -^‘^1 



( 2 ) 



Developing the quadratic form in (2), one is led to 



p.|^(x|0) = ce^cos(^-^). 



where c = c(x, cr„) and 



rj = arg(x) 



\ _ 1^1 

A 2 • 
(yz 



( 3 ) 

( 4 ) 

( 5 ) 



The likelihood function Px\ 4 ,{x\<p) is 27r-periodic with respect to (j) with max- 
ima at (j) = 2nk + rj, for A: G Z (Z denotes the integer set). Thus 77 is a maximum 

^ For compactness, lowercase letters will denote random variables and their values as 
well. 
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likelihood estimate of (f>. The peakiness of the maxima of (3), controlled by pa- 
rameter A, is an indication of how trustful data is. 

The observation model (1) does not apply to applications exhibiting speckle 
noise such as synthetic apertura radar and synthetic aperture sonar. We have 
shown in [10], however, that the observation model of these applications leads 
to an observation density with the same formal structure given by formula (3). 

Let 4> = {‘t’ij I (b j) £ Z} and x = | (i, j) e Z] denote the absolute phase 

and complex amplitude associated to sites Z = {{i,j)\i,j = (we 

assume without lack of generality that images are squared). Assuming that the 
components of x are conditionally independent, 

Px|0(x|0) = W ■ (6) 

%jez 

The conditional independence assumption is valid if the resolution cells as- 
sociated to any pair of pixels are disjoint. Usually this is a good approximation, 
since the point spread function of the imaging systems is only slightly larger than 
the corresponding inter-pixel distance (see [20]). 



2.2 Prior Model 

Image 4> is assumed to be smooth. Gauss- Markov random fields [14], [15] are both 
mathematically and computationally suitable for representing local interactions, 
namely to impose smoothness. We take the first order noncausal GMRF 

P0(0) oc exp I \ , (7) 

where = {{i,j)\ i,j = 

and means the variance of increments and 



2.3 Posterior Density 

Invoking the Bayes rule, we obtain the posterior probability density function of 
4 >, given X, as 

P0|x(0|x) ocpx|0(x|</>)p0(<?!)), (8) 

where the factors not depending on 4> were discarded. Introducing (6) and (7) 
into (8), we obtain 

Y cos{(f,j - THj) - ^ Y 

p,^|x(<?!>|x) oc edsz ijeZi ^ 

The posterior distribution (9) is assumed to contain all information one needs 
to compute the absolute phase estimate 4>- 
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3 Estimation Procedure 



The MAP criterion is adopted for computing <f). Accordingly, 

^MAP = argmaxp,^|x((/)|x). 

4 > 



( 10 ) 



Due to the periodic structure oipx\<j>{A4‘)^ computing the MAP solution leads 
to a huge non-convex optimization problem, with unbearable computation bur- 
den. Instead of computing the exact estimate 4>map^ resort to a suboptimal 
scheme that delivers nearly optimal estimates, with a far less computational 
load. 

Let the absolute phase (f)ij be uniquely decomposed as 

^ij — T 27t/cjj , (ff) 

where kij = [{(ptj + 7r)/(27r)J G Z is the so-called wrap-count component of 
and tpij E [— 7T,7r[ is the principal value of (p^j. The MAP estimate (10) can be 
rewritten in terms of -(/> = {pJij \ {i,j) G Z} and k = {k^j \ (i,j) G Z} as 



i^MAP^'^MAp) = argmaxp0|x(V’ + 27rk|x) 



arg |™|ix |maxp0|x(V’ + 27rk|x)|| . 



( 12 ) 

(13) 



Instead of computing (13), we propose a procedure that successively and iter- 
atively maximizes + 27rk|x) with respect to k G Z-^ and xp G [— tt, tt[^ . 

We term this maximization on the sets Z and [— 7r,7r[ as the ZttM algorithm; 
Fig. 1 shows the corresponding pseudo-code. 



Initialization: xp = tj 

For t = 1, 2, . . . , 

Unwrapping step: 

k*^*^ = argmaxp,i|x,i('i/’*'*^^^ + 27rk|x) (14) 

k ’ 

Smoothing step: 

= argmaxp^l^ j(V> + 27rk^*V) (15) 

Termination test: 

If [P,)!>|x,l($^*V) -Pc^>|x,l($*'* ^V)l<^ 
break loop for 



Fig. 1. ZttM Algorithm. 



The ZttM algorithm is greedy, since the posterior density p0|x(</>|x) can not 
decrease in each step of the each iteration. Thus, the stationary points of the 
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couple (14)-(15) correspond to local maxima of Nevertheless, the pro- 

posed method yields systematically good results, as we will show in next section. 

The unwrapping step (14) finds the maximum of the posterior density 
p^\^{(f)\x) on a mesh obtained by discretizing each coordinate cf>ij according to 
(11). The first estimate delivered by the unwrapping step is based on the 
maximum likelihood estimate r] = {rpj \ {i,j) G Z}. Smoothing is implemented 
by the 7r-step (15). This is in contrast with the scheme followed by most phase 
unwrapping algorithms, where the phase is estimated with basis on on a smooth 
version of 77 , under the assumption that the phase (f> is constant within windows 
of given size. This assumption leads to strong errors in areas of high phase rate. 



3.1 Z-Step 

Since the logarithm is strictly increasing and cos{%pij + 2nkij — does not 
depend on kij, solving the maximization step (14) is equivalent to solve 

k = argmin £’(k|’j/j), 

k 

where the energy E(k\xp) is given by 

ij&Zi 

with 

- A^Pll 

and Axjj'^j = V'ij-i - and AipG = 

A simple but lengthy manipulation of equation (17) allows us to write 

k = arg min (k - ko)^A(k - ko), (20) 

where the column vector k is the column by column stacking of matrix k, ma- 
trix A is nonnegative block Toeplitz and symmetric, and vector ko depends on 
Atpl^j and AipG, For nonnegative symmetric matrices A, the integer least square 
problem (20) is known as the nearest lattice vector problem and it is NP-hard 
[21]. It arises, for example, in highly accurate positioning by Global Positionning 
System (GPS) [22], [23]. Works [24], [21], [22] propose suboptimal polynomial 
time algorithms for finding an approximatly nearest lattice solution. 

In our case, energy i?(kji/>) is a sum of quadratic functions of (fc,j — /ci-i,j) 
and {kij — k^j^i). This is a special case of a nearest lattice vector problem, for 
which we propose a network programming algorithm that finds the exact solution 
in polynomial time. The algorithm is inspired in the Flyn’s minimum disconti- 
nuity approach [25], which minimizes the sum of ] [A(f}^j + ttJ | and | [A4>G + ttJ |, 



(16) 

(17) 

(18) 
(19) 




A Discrete/Continuous Minimization Method 381 



where [xj denotes the hightest integer lower than x. Flyn’s objective function 
is, therefore, quite different from ours. However, both objective functions are 
the sum of first order click potentials depending only on and This 

structural similarity allows us to adapt Flyn’s ideas to our problem. 

The following lemma assures that if the minimum of E(k\ip) is not yet 
reached, then there exists a binary image 5k (i.e., the elements of 5k are all 
0 or 1) such that E{h + 5k|i/>) < £'(k|i/)). 

Lemma 1 Let ki and k 2 be two wrap-count images such that 

^(kalV-) < -e(kilV'). (21) 

Then, there exists a binary image 5k such that 

F;(ki +5k|'0) < £;(ki|V^). (22) 

Proof. See [26]. 

According to Lemma 1, we can iteratively compute k, = kj_i + 5k, where 5k e 
{0, 1}^ minimizes £'(k,_i + 5k|i/>), until the the minimum energy is reached. 
Each minimization is a discrete optimization problem that can be exactly solved 
in polynomial time by using network programming techniques such as maximum 
flow [27] or minimum cut [28]. We note however that, in the iterative scheme just 
described, it is not necessary to compute the exact minimizer of E(kj_i + 5k|'i/>) 
with respect to 5k, but only a binary image 5k that decreases ^^(ki-i + S'k.\tp). 
Based on this fact we propose an efficient algorithm that iteratively search for 
improving binary images 5k. 

The following lemma, presented and proofed in the appendix of [25] , assures 
that if there exists an improving binary image 5k [i.e., iS(k + 5k|^/j) < E(k\if)], 
then there exists another improving binary image 51 such that the sets S'i(51) = 
{{i,j) e Z\Slij = 1} and 5'o(51) = {{i,j) G Z\6lij = 0} are both connected 
in the first order neighborhood sense; i.e., given two sites si and s„ of Si {Sq), 
there exists a sequence of first order neighbors, all in Si (So), that begins in si 
and ends in s„. We call images 51 with this property, binary partitions of Z. 

Lemma 2 Suppose that there exits a binary image 5k such that 

£'(k+5k|V^) < E{k\if). 

Then there exists a binary partition of Z, 51, such that 

£:(k + 51|V’) < E{k\if). 

Proof See Lemma 2 in the appendix of [25]. 

Flyn’s central idea is to search for improving binary partitions 51 [termed in [25] 
an elementary operation (EO)]. Once 51 is found the wrap-count image k is up- 
dated to k + 51. If no EO is possible then, according to Lemma 2, energy iJ(k|'j/>) 




382 



Jose M.B. Dias and Jose M.N. Leitao 



0 0 

0 0 

® 

0 -0 

0 s 

® ( 2 ): 

T 

0- 0 



0 

0- 

0 



0 0 







@ s 



<8> 

0 0 



0 

0 

0 

0 

0 



0 0 0 0 0 0 

Fig. 2. Auxiliary graph to implement Flyn’s algorithm (squared nodes) interleaved 
with phase sites (circled and crossed nodes). A leftward (rightward) edge indicates an 
unit increment of the wrap-count below (above) the edge. A downward (upward) edge 
indicates an unit increment of the wrap-count right (left) to the edge. 



can not be decreased by any binary image increment of the actual argument k. 
Thus, by Lemma 1, Ll(k0) has reached its minimum. 

To check if a given binary partition dl improves the energy, one has to com- 
pute only those click potentials of containing sites on both sets S’i(<51) 

and 5'o((^l); i.e., one has to compute click potentials of i5(k|-0) only along loops 
(this is still true on the boundary of Z by taking zero potentials). The Flyn’s 
algorithm uses graph theory techniques to represent and generate EOs. Figure 2 
shows an auxiliary graph, whose nodes are interleaved with the phase sites. The 
edges sign which wrap-counts are to be incremented: a leftward (rightward) edge 
indicates an unit increment of the wrap-count below (above) the edge. A down- 
ward (upward) edge indicates a unit increment of the wrap-count right (left) to 
the edge. The algorithm works by creating and extending paths made of directed 
edges. When a path is extended to form a loop, the algorithm performs an EO, 
removes the loop from the collection of paths and resumes the path extension. 

Assume that the array of auxiliary nodes has indexes in the set {{i,j)\i,j = 
1, . . . , N + 1}. Define the cost of an edge SV{i,j; i',j') between the first order 
neighbors and {i,j) as iJ(k0) — £((k-|- 5k0), where <5k is the wrap-count 

increment induced by the edge. With this definitions and having in attention the 
structure of E{k\jf}) [see (17)], we are led to 

5V{i,j\i,j - 1) = -47r(7T -b 

- 1;LJ) = -47r(7T - 

5V{i - 1, j) = -47r(7T -b 

6V{i,j;i - 1, j) = -47r(7r - 

The values of boundary edges are defined to be zero; i.e., 5V{1, j) = 5V{N + 
1, j) = 5V{i, 1) = 5V{i, N + 1) = 0. 
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Figure 2 represents the state of the graph at a given instant. Assuming that 
there are no loops, the set of edges defines a given number of trees. The value of 
each node, V{i,j), is the sum of edge values corresponding to the path between 
the node and the tree root. In Figure 2 there are two trees. We stress that the 
node values are real numbers, whereas in the Flyn’s algorithm they are integers. 
The reason is that our energy E(k\tp) takes values in the non-negative reals 
while the Flyn’s energy takes values on the positive integers. 

The basic step of Flyn’s algorithm is to revise the set of paths by adding a 
new edge. An edge from (i,j) to a first order neighbor if not presented, 

is added if 

> 0 . 

If AV < 0 then the new path to would have a negative or zero value or 

would fail to improve an existing path. If the edge is added the set of paths is 
revised in one of the three possible ways (a minor modification of [25]): 1) edge 
addition, 2) edge replacement, and 3) edge completion. 

The dashed edges in Fig. 2 illustrate graph revision of type 1, 2, and 3. For 
a more detailed example, see Flyn’s paper [25]. 

The algorithm alternates between type 1 and type 2 revisions until a loop 
is found, performing then a type 3 revision. If for any attempt of edge addition 
AV < 0, then no loop completion is possible and, according to Lemma 2 and 
Lemma 1, the algorithm terminates. 

Flyn’s algorithm [25] and Costantini’s [29] algorithm are equivalent, as they 
minimize the norm. Costantini has shown that minimization is equivalent 
to finding the minimum cost flow on a given directed network. Minimum cost 
flow is a graph problem for which there exists efficient solutions (see, c.g. [30]). 
We do not implement our Z-step using Costantini’s solution because the graph 
can not be used with norm for p ^ 1. 

Another alternative to implement the Z-step might be the discrete optimiza- 
tion scheme proposed in [31]. Authors of this paper claim that their approach, 
based on the maximum flow algorithm applied to a suitable graph, minimizes 
any energy function in which the smoothness term is convex and involves only 
pairs of neighboring pixels. However, the graph for a given convex smoothness 
function is not presented in [31]. 



3.2 Smoothing Step 

The smoothing step (15) amounts to compute xp given by 

^ = arg max ^ cos(0^ - p.j) - ^ ^ )^ (23) 

where (pij = 2Trkij +ipij. The function to be maximized in (23) is not convex due 
to terms A.^ cos^tpij Computing ip is therefore a hard problem. Herein, we 

adopt the ICM approach [14], which, in spite of being suboptimal, yields good 
results for the problem at hand. 
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ICM is a coordinatewise ascent technique where all coordinates are visited 
according to a given schedule. After some simple algebraic manipulation of the 
objective function (23), we conclude that its maximum with respect to V’ij is 
given by 

Aj = arg max {(3ij cos{tp,j - } , (24) 



where 



Pij — 

^^j = 



2fJ, 

4^ij ^Tvhij 







4 



(25) 

(26) 
(27) 



There are no closed form solutions for maximization (24), since it involves 
transcendent and power functions. We compute using a simple two-resolution 
numeric method. First we search ^ij in the set {ni/M \ i = —M, . . . ,M — 1}. 
Next we refine the search by using the set {irio/M +Tii/M^ \ i = —M, . . . , M — 1}, 
where irio/M is the result of the first search. We have used M = 20, which leads 
to the maximum error of 7 t/(20)^. 

Phase estimate depends in a nonlinear way on data and on the mean 
weighted phase ijjij. The balance between these two components is controlled 
by parameter Assuming that \^ij — r]tj\ n, then cos{ipij — is well 
approximated by the quadratic form 1 — {ipij — ??jj)^/2, thus leading to the linear 
approximation 



^ PijVij T 2p-ij 
Pij + 2 



(28) 



Reintroducing (28) in the above condition, one gets -C 2n/{Ptj +2). If 

this condition is not met, the solution becomes highly nonlinear on and tpiji 
as I'ipij — r/jjl increases, at some point the phase pij becomes thresholded to ±7 t, 
being therefore independent of the observed data 

Concerning computer complexity the Z-step is, by far, the most demanding 
one, using a number of floating point operations very close to the Flyn’s min- 
imum discontinuity algorithm. Since the proposed scheme needs roughly four 
Z-steps, is has, approximately 4 times the Flyn’s minimum discontinuity algo- 
rithm complexity. To our knowledge there is no formula for the Flyn’s algorithm 
complexity (see remarks about complexity in [25]). Nevertheless, we have found, 
empirically, a complexity of approximately 0{N^) for the Z-step. 



4 Experimental Results 

The algorithm derived in the previous sections is now applied to synthetic data. 

Figure 3 displays the interferogram (t/ = {rjij} image) generated according 
to density (2) with noise variance (t„ = 1.05. The absolute phase image (p is 
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Fig. 3. Interfergram (r 7 -image) of a Gaussian elevation of height Idw rad and standard 
deviations ai = 10 and aj = 15 pixels. The noise variance is an = 1.05. 



a Gaussian elevation of height 147 t rad and standard deviations cr^ = 10 and 
aj = 15 pixels. The magnitude of the phase difference 4>ij+i — 4>ij takes the 
maximum value of 2.5 and is greater than 2 in many sites. On the other hand 
a noise variance of cr„ = 1.05 implies a standard deviation the maximum like- 
lihood estimate rjij of 0.91. This figure is computed with basis on the density 
of r] obtained from the joint density (2). In these conditions, the task of abso- 
lute phase estimation is extremely hard, as the interferogram exhibits a large 
number of inconsistencies; i.e., the observed image rj is not consistent with the 
assumption of absolute phase differences less than tt in a large number of sites. 
In the unwrapping jargon the interferogram is said to have a lot of residues. 

The smoothness parameter was set to ji = 1/0.8^, thus modelling phase 
images with phase differences (horizontal and vertical) of standard deviation 
0.8. This value is too large for most of the true absolute phase image </) and too 
small for sites in the neighborhood of sites {i = — 45, j = 50) and (z = 55, j = 50) 
(where the magnitude of the phase difference has its largest value). Nevertheless, 

the ZttM algorithm yields good results as it can be read from Fig. 4; Fig. 4(a) 

^( 1 ) ^( 10 ) 
shows the phase estimate 0 and Fig. 4(b) shows the phase estimate 0 

Figure 5 plots the logarithm of the posterior density lnp0|x(0*' ^|x) and the 
norm of the estimation error ||0 — 0||^ = ~ function of the 

iteration t. The four non-integers ticked between two consecutive integers refer 
to four consecutive ICM sweeps, implementing the vr-step of the ZttM algorithm. 

^(t) 

Notice that the larger increment in lnp,jj|x(0 |x) happens in both steps of the 
first iteration. For t >2 only the Z-step produces noticeable increments in the 
posterior density. These increments are however possible due to the very small 
increments produced by the smoothing steep. For t > 4 there is practically no 
improvement in the estimates. 

To rank ZttM algorithm, we have applied the following phase unwrapping 
algorithms to the present problem: 
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First iteration 



Tenth iteration 




Fig. 4. Phase estimate (j)^ (a) t 



1; (b) t = 10 . 




iteration - 1 



Fig. 5. Evolution of the logarithm of the posterior density lnp^|x(<;&^ ^|x) and of the 
norm of the estimation error as function of the iteration t. Z-steps coincide with 
integers, whereas ICM sweeps implementing 7r-step are assigned to the non-integer part 
of t. 



— Path following type: Golstein’s branch cut (GBC) [32]; quality guided 
(QG) [33], [34]; and mask cut (MG) [35] 

— Minimum norm type: Flyn’s minimum discontinuity (FMD) [25]; 
weighted least-square (WLS) [36], [37]; and norm (LON) (see [1, ch. 5.5]) 

— Bayesian type: recursive nonlinear filters [9] and [10] (NLF). 

Path following and minimum norm algorithms were implemented with the code 
supplied in the book [1], using the following settings: GBC (-dipole yes); QG, 
MG, (-mode min_var -tsize 3); and WLS (-mode min_var -tsize 3, -thresh yes). 
We have used the unweighted versions of the FMD and LON algorithms. 

Table 1 displays the norm of the estimation error \\(p — for each of 
the classic algorithm referred above. Results on the left column area based on 
the maximum likelihood estimate of rj given by (4), using a 3 x 3 rectangular 
window. Results on the right column are based on the interferogram -q without 
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Table 1. norm of the estimation errors of ZttM and other unwrapping algorithms. 
The left column plots results based of the the maximum likelihood estimate of r] using 
a 3 X 3 rectangular window; the right column plots results based on the non-smooth tj 
given by (4). 



Algorithm 




Smooth 7) Non-smooth 77 


ZttM 


- 


0.1 


GBC 


48.0 


7.0 


QG 


10.0 


2.2 


MC 


40.8 


28.6 


FMD 


22.4 


3.4 


WLS 


00 

bo 


3.5 


LON 


24.1 


2.6 


NLF 


- 


40.1 



any smoothing. Apart from the proposed ZttM scheme, all the algorithms have 
produced poor results, some of them catastrophic. The reasons depend on the 
class of algorithms and are are basically the following: 

— in the path following and minimum norm methods the noise filtering is the 
first processing steep and is disconnected from the phase unwrapping process. 
The noise filtering assumes the phase to be constant within given windows. 
In data sets as the one at hand, this assumption is catastrophic, even using 
small windows. On the other hand, if the smoothing steep is not applied, 
even if algorithm is able to infer most of the 2-k multiples, the observation 
noise is fully present in estimated phase 

— the recursive nonlinear approaches [9] and [10] fails basically because they 
use only the past observed data, in the lexicographic sense, to infer the 
absolute phase. 

5 Concluding Remarks 

The paper presented an effective approach to absolute phase estimation in inter- 
ferometric appliactions. The Bayesian standpoint was adopted. The likelihood 
function, which models the observation mechanism given the absolute phase, is 
27T-periodic and accounts for interferometric noise. The a priori probability of the 
absolute phase is a noncausal first order Gauss Markov random field (GMRF). 

We proposed an iterative procedure, with two steps per iteration, for the 
computation of the maximum a posteriory probability MAP estimate. The first 
step, termed Z-step, maximizes the posterior density with respect to the 27 t 
phase multiples; the second step, termed vr-step, maximizes the posterior den- 
sity with respect to the phase principal values. The Z-step is a discrete opti- 
mization problem solved exactly by network programming techniques inspired 
by Flyn’s minimum discontinuity algorithm [25] . The 7r-step is a continuous opti- 
mization problem solved approximately by the iterated conditional modes (ICM) 
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procedure. We call the proposed algorithm ZttM, where the letter M stands for 
maximization. 

The ZttM algorithm, resulting from a Bayesian approach, accounts for the 
observation noise in a model based fashion. More specifically, the observation 
mechanism takes into account electronic and decorrelation noises. This is a cru- 
cial feature that underlies the advantage of the ZttM algorithm over path fol- 
lowing and minimum-norm schemes, mainly in regions where the phase rate is 
close to 7T. In fact, these schemes split the absolute phase estimation problem 
into two separate steps: in the first step the noise in the interferogram is filtered 
by applying low-pass filtering; in the second step, termed phase unwrapping, the 
27 t phase multiples are computed. For high phase rate regions, the application 
of first step makes it impossible to recover the absolute phase, as the principal 
values estimates are of poor quality. This is in contrast with the ZttM algorithm, 
where the first step, the Z-step, is an unwrapping applied over the observed 
interferogram. 

To evaluate the performance of the ZttM algorithm, a Gaussian shaped sur- 
face whit high phase rate, and OdB of signal to noise ratio was considered. We 
have compared the computed estimates with those provided by the best path 
following and minimum-norm schemes, namely the Golstein’s branch cut, the 
quality guided, the Flyn’s minimum discontinuity, the weighted least-square, 
and the LP norm. The proposed algorithm yields good results, performing bet- 
ter and in some cases much better than the s technique just referred. 

Goncerning computer complexity, the ZttM algorithm takes, approximately, a 
number of floating point operations proportional to the 1.5 power of the number 
of pixels . By far, the Z-step is the most demanding one, using a number of float- 
ing point operations very close to the Flyn’s minimum discontinuity algorithm. 
Since the proposed scheme needs roughly four Z-steps, is has, approximately 4 
times the Flyn’s minimum discontinuity algorithm complexity. 

Goncerning future developments, we foresee the integration of the principal 
phase values in the posterior density as a major research direction. If this goal 
would be attained then the wrapp-count image would be the only unknown of 
the obtained posterior density and, most important, there would be no need 
for iterativeness in estimating the wrapp-count image. After obtaining this im- 
age, the principal phase values could be obtained using the 7r-step of the ZttM 
algorithm. 
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Abstract. This paper addresses the problem of minimizing an energy 
function by means of a monotonic transformation. With an observation 
on global optimality of functions under such a transformation, we show 
that a simple and effective algorithm can be derived to search within 
possible regions containing the global optima. Numerical experiments are 
performed to compare this algorithm with one that does not incorporate 
transformed information using several benchmark problems. These re- 
sults are also compared to best known global search algorithms in the 
literature. In addition, the algorithm is shown to be useful for a class of 
neural network learning problems, which possess much larger parameter 
spaces. 



1 Introduction 

Typically, the overall performance of an application in computer vision, pattern 
recognition, and many other fields of machine intelligence can be described or 
approximated by a multivariate function, where the best solution is achieved 
when the function attains its global extremum. This function to be optimized, 
the so-called objective function or energy function, is usually formulated to be 
dependent on a certain number of state variables or parameters. Due to the com- 
plexity of physical systems, this objective function is very likely to be nonlinear 
with respect to its parameters. Depending on the number of parameters and 
the intrinsic problem nature, the task of locating the global extremum can be 
extremely difficult. This is due to the unsolvable nature of the general global op- 
timal value problem [17] and the omni-presence of local extrema, whose number 
may increase exponentially with the size of the parameter vector. Furthermore, 
in real world applications, there may be fiat regions which can mislead gradient- 
based search methods. Worst still, difficulties may arise when these gradients 
are different by many orders of magnitude. 

As it remains to be the most challenging task to come out with a generic and 
practical characterization for global optimality, main research effort has been 
focused on specially structured optimization problems and exhaustive means [6, 
7]. While a multitude of ideas have been attempted in devising effective search 
methods for global solutions via numerical means, the various approaches can 
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nevertheless be broadly classified as follows: (i) deterministic approaches, (ii) 
probabilistic/heuristic approaches, and (iii) a mix of deterministic and proba- 
bilistic methods. Conceptually, most deterministic methods adopt strategies that 
are covering-based or grid-search-based, and those that use an ingenious tunnel- 
ing mechanism or trajectory trace to search through the domain of interest for 
the global solution [1, 7, 18]. Improvements from such strategies are usually in the 
form of reducing the computational requirements by narrowing down the search 
progressively, and discarding regions which are not likely to contain the global 
optima [6,7]. For probabilistic methods, various random search methodologies 
which inherit the ‘hill-climbing’ capability are employed to locate the global op- 
tima (see e.g. [2,6]). Theoretically, probabilistic methods do not guarantee (this, 
indeed, is a deterministic requirement) that global optimal solutions are reached 
within finite search instances, even though asymptotic assurance can usually be 
achieved. The third class of methods attempts to combine merits of both de- 
terministic and probabilistic methods to achieve efficient means in attaining the 
global optimal solution. 

In this article, the following two issues are addressed: (i) to provide an obser- 
vation on global optimality of functions under monotonic transformation, and 
(ii) to illustrate the usefulness of such observation using a deterministic search 
procedure which is derived from the well-established nonlinear programming 
literature. We show that it is possible to make use of a structurally known 
transformation to ‘extract’ the required structural information from the func- 
tion of interest. In short, by a notion of “relative structural means for relative 
structural information”, we propose to use the monotonic transformation for 
characterization of global optimality. 

The paper is organized as follows: in next section, we provide notations and 
definitions related to the subject matter. In section 3, the necessary and sufficient 
conditions for global optimality based on monotonic transformation are shown. 
Based on this, regions that contain the global optimal solutions are identified 
using a level set. In section 4, a global descent algorithm is derived to search 
within this level set for global optimal solutions. This is followed by section 5 
where we compare two settings of our search procedure numerically with one that 
does not use the proposed transformation. The results in terms of the number 
of function evaluations and standard CPU time required are also compared with 
best known global optimization algorithms in the literature. In section 6, we 
further illustrate the usefulness of the algorithm for neural network applications 
which represent systems of a larger scale. Finally, some concluding remarks are 
drawn. 



2 Notations And Definitions 

Unless otherwise stated, vector and matrix quantities are denoted using bold 
lowercase characters and bold uppercase characters respectively, to distinguish 
from scalar quantities. The superscript ‘T’ on the respective character is used to 
denote matrix transposition. 
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2.1 Minimization problem 

Let P be a p-dimensional compact subset of real space and / be a continu- 
ous function from V to TZ. Since maximization of the function / is equivalent to 
minimization of (—/), without loss of generality, we define the problem of global 
optimization to be 

(7^) : arg min/(0), 6 eV, (1) 

0 

assuming existence of at least a solution (see Definition 1, as will be defined 
next). 



2.2 Solution and solution set 

In this context, we refer to the solution and solution set of a minimization prob- 
lem as: 

Definition 1 Let f be a function to be minimized from V to TZ where V C TZ^ 
is non-empty and compact. A point 9 € T> is called a feasible solution to the 
minimization problem. If 9*g £ V and f{6) ^ f{9*) for each 9 £ V, then 9* is 
called a global optimal solution (global minimum) to the problem. If 9* £V and 
if there exists an e -neighborhood N^{9*) around 9* such that f{9) > f{9*) for 
each 9 £ 25n then 9* is called a local optimal solution (local minimum). 

The set which contains both local optimal solutions and global optimal solutions 
is called a solution set (denoted by O* ). 

3 Global Optimality 

3.1 Global optimality of functions under monotonic transformation 

Denote IZ^ = (0,cx)). Consider a strictly decreasing transformation (f on the 
function to be minimized. Denote by <(P{-) a transformation which is raised to 
some power ^ £ IZ. We shall make use of the following observation for our 
development. 

Proposition 1. Let f : T> ^ TZ be a continuous function where V C IZ^ is 
compact. Let (f> : TZ ^ TZ^ be a strictly decreasing function. Suppose 9* £ V. 
Then 9* is a global minimizer of f if and only if 

7^00 )) 

Proof: This follows from the fact that j) > 0 so that lim.y^oo ^ 

is equivalent to <1- * 

Consider the solution set as defined by Definition 1. Then, using a more 
structured convex transformation f, the following result is a direct consequence 
of Proposition 1. 
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Proposition 2. Let f : V ^ TZ be a eontinuous function where V C TV is 
compact. Let -.TZ —> TZ'^ be a strictly decreasing and convex function. Denote 
by 0* the solution set given by Definition 1. Suppose 0* G 0* . Then 0* is a 
global minimizer of f if and only if 

=0- e 0*, f{9) ^ fie*). (3) 

7-^00 f^ifiO )) 

Remark 1:- Consider xg, xi,X 2 G 7Z and suppose xg = (1 — a)a;i+ax2, Xi X 2 , 
0 < Q ^ 1. Let <t>{x 2 ) = <t>ix\) — 5 for some Xi < X 2 and 0 < ci < (p{xi). Since 
(p{-) is strictly decreasing and convex, it follows that 



4’"'{xg) ^ [(1 - odjfixi) + a(p{x 2 )V , 0 < Q ^ 1 
ad 



which implies that 



1 






rjxg) ^ / ad y 
^ V 4>{xi)J 



0 < a ^ 1. 



( 4 ) 

( 5 ) 



For any arbitrary d G (0, (p{xi)), we see that the ratio of any transformed points 
between x\ and X 2 over the transformed x\ (i.e. [xg) / fP {xx)) is held under 
a prescribed curve (1 — a5 / fixx))'^ for 0 < a < 1. The result of such trans- 
formation is a structured magnifying effect preserving the relative orders of all 
extrema. Suppose X\ is the global minimizer, we see that for any 7 > 1, its 
neighborhood value xg will be held (1 — a5 / fixi))"* times below the value of 
(pixi). We shall illustrate this with an example. □ 



3.2 An illustrative example 

Visually, the role of f is to magnify the relative ordering of global minima and 
local minima, but in a reverse sense. Notice that in the minimization case, a 
small value of / is mapped to a relatively much higher value of <f> because cp is 
strictly decreasing and convex. This effect of relative ordering of minima is even 
more noticeable when f is raised to a high power 7. 

Consider the 2-dimensional Rastrigin function [17] for minimization within 
[—1,1]^. Let (f^if) = where /l G 77. denotes a reference function 

value. It can be shown that this transformation function is convex and strictly 
decreasing. In order to maintain similar scalings for the plots, we take /l = f* 
if* is the global minimum value) in this illustration. The 3-D mesh plot for the 
original Rastrigin function is shown in Fig. l-(a). Here we see that determining 
a threshold level directly on / which will differentiate the global minimum from 
the other local minima is difficult. 

The 3-D mesh plots for three versions of the transformed function using 
7 = 1,4,20 are shown in Fig. l-(b),(c),(d) respectively. It is clear from Fig. 1 
that transformation with a high 7 value has made the global minimum ‘stands 
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out’ from all other local minima. Effectively, the monotonically decreasing con- 
vex transformation suppresses local minima to a greater extent than the global 
minimum, and hence allowing the global minimum to emerge among the omni- 
presence of local minima. 



(a) Ftastngir function (b) Transformed Rastrigin function 




Fig. 1. Rastrigin function /: (a) before transformation, and (b)-(d) after transforma- 
tion using (fP{f) = e 1^4^20 



3.3 A sufficient condition 

Lemma 1 . Let f : V ^ TZ be a function to be minimized where f is continuous 
on the compact set T>. Let (p :TZ ^ TZ'^ be a strictly decreasing function. Denote 
by 0* the solution set given by Definition 1. Let 0 < ry < 1 and suppose 0* £ (9*. 
If there exists 7o > 1 such that 

rifm^vf>^f{e*)), f{e)^f{e*) (6) 

for all 7 ^ 7 o and for all 9 & 0* , then 6* is a global minimizer of f. 

Proof: Let 6 e 0* where f(0) ^ f(9*). By the hypothesis in the lemma, 
there exists 7o > 1 such that (p{f{9)) ^ rjcp'^ (f{9*)) for all 7 ^ 70. Thus 
t ^( f { a ')) ^ »7 < 1 for 7 > 7o. Passing to limit, we have lim^^oo = 0- 

By Proposition 1, 9* is a global minimizer. ■ 

Remark 2:- When a structured magnifying effect as shown in section 3.2 is 
required, Lemma 1 can be adapted according to Proposition 2 for a 0 which is 
strictly decreasing and convex. 

For the most general problem of global optimization, the global optimal value 
is unknown. While noting that it is difficult to enumerate the entire solution set 
to validate (6) for all 9^0* assuming 70 exists, we shall observe in what follows, 
how Lemma 1 can be utilized to solve problem (V). □ 
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Denote the global minimizer by 0* and let /* := f{0*g)- Suppose we have a 
7 satisfying Lemma 1. Let 

^e(o,i). (7) 

Then, the problem of locating 9* in (V) can be narrowed down to a search on 

(V') : argmin f{9) subject to 0 £ (8) 

0 

where 

L^(C) = {0|</>'^(/(0))^C, (9) 

is a level set containing 0*. Here, it is required to locate a C > 0 (which satisfies 
(7)) cutting on (P{f{6)) so that the global solutions are segregated from all 
other local solutions. As illustrated, the task of locating such a cutting level C 
in Fig. l-(d) is easier than that in Fig. l-(a) since the range of C given by (7) 
is larger than that given by /* < ( ^ /* (where /* denotes the second lowest 
minimum value of /) on the original function. Moreover, for practical reasons, 
the right-hand-side of (7) can be relaxed and {V') can be solved for the best 
achievable solution closest to qualifying (j)'^{f{0)) ^ C > V G (0) 1) and 

9 ^T>. We shall observe, for the following types of problems, how such cutting 
level can readily be located for the transformed function. 



(i) Minimization of functions with known minimum value or its 
infimum 

There is a class of minimization problems where the global minimum value 
is known a priori. For instance, the problem of solving /(0) = 0 can be re- 
formulated as a problem minimizing (/(0))^ for 0. This class of problems is, 
indeed, solvable by the argument using level sets [17]. By our characterization, 
global optimality is attained at <^'’'(0) where a cutting level at C £ {r](p'*{0), <(>^(0)], 
r] G (0, 1), 7 > 1 can be easily chosen to segregate all other local minima for a 
sufficiently high 7 value that results in a small r; value. In short, we can simply 
choose a C that is close to 0'’'(O). 

Instead of a known global minimum value, another class of minimization 
problems may come with knowledge of only an infimum or a lower bound of 
the global minimum value. This includes approximation problems using least 
squares type of error criterion where the lowest possible objective function value 
is approaching zero. Here, zero is a lower bound of the unknown global minimum 
value. For such cases, a level set can be defined with respect to the required 
solution quality. For instance, given an approximation problem: /(0) = (5* — 
5f(0))^ where the function g G 77 on 0 £ 7^^ is to approximate g* G TZ within a 
certain error tolerance: f ^ e = f* + s, e > 0, f* G {0, e). In this case, a cutting 
level constraint can be selected with respect to the error tolerance at C = <('^(e)) 
7 > I when minimizing /. 
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(ii) Minimization of functions without prior knowledge of global 
minimum value 

For functions with no prior knowledge of the global minimum values or the lower 
bounds, the following iterative procedure can be adopted to locate a cutting level 
on the transformed function segregating the global optimal solution from those 
local ones. 

1. Initialization: Select a 7 > 1 and initialize a level offset /l G 7^ with 

respect to an arbitrary estimate £ 77, i.e. set /l ^ /(^o)- Set initial 
cutting level for the transformed function as Co = + ^0 for some 

Aq > 0 . 

2. Search: Locate a local minimizer 6\ of / satisfying (p'^{f{6) — ^ Co and 

0 e r>. If such a solution exists, set /l ^ /( 0i), Ci = fo^" some 

Z\i > 0 and continue to locate the next local minimizer 62 of / satisfying 
0n/W-/L)^Ci and0GP--- 

3. Termination: The search terminates until no minimizer 9 k+i can be found 
for the current choice of /l equal to the local minimum value f{ 9 k) and with 
smallest possible Ak > 0. 

While leaving the implementation details to next section, we shall illustrate 
by an example the fact that the problem of locating such a cutting level becomes 
‘easier’ ^ as the level search proceeds on a transformed function. 



3.4 A global level search example 



Consider the following univariate function to be minimized: 

f{x) = — cos(20x), — 1 X g; 1. (10) 



Note that since the concept applies to multivariate case as well, this example is 
used without loss of generality. The plot for this function is shown in Fig. 2(a) 
where we have a unique global optimal solution at x = 0. Suppose we do not 
have prior knowledge of the global minimum value /* = —1. We see that even a 
small difference between two cutting levels on / near the global optimal value can 
result in minimizers which are far apart. Hence, if a search is performed directly 
on this function, it is difficult to locate a level such that it just segregates the 
ultimate minimum from other minima. The situation can be worse if the function 
is extremely flat near the global solution. 

Now, consider the following transformation function where its monotonicity 
and convexity can be easily verified by checking its first and second derivatives: 



0 (/) = 



J 


1 


< — log 


1 + 




( 11 ) 



^ In the sense of obtaining a clear cutting level on (j>'^ {f ( 6 )) that segregates the global 
minima from local minima. 
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(a) Original fjnctlon 




(c) Transformed function without offset 




(e) Transformed fn offset by 2nd minimum 




(b) Convex monotonic transformation 




(f) Transformed fn offset by 3rd minimum 



2.5 




-1 -0.5 0 0.5 1 



Fig. 2. Level search over the transformed function 

The plots for 0 over /, and (p over x, are shown in Fig. 2(b) and (c), respectively, 
with /l = 0. We note that this particular transformation suppresses function 
values which are greater than zero. 

As we have seen in section 3.2, a high 7 eases the task of locating a suitable 
level to segregate the global minima from local ones (hopefully within a single 
step) . While not expecting to conduct an exhaustive level search for all the local 
minima in a descending order under most circumstances, we shall walk through 
a step by step level search, illustrating the convergence. Here, to illustrate only 
the effect of the level offset, we have set 7 = 2 for all cases in the figure. 

The plots for (p{f{x)) over x which are offset at (/~/li) through (/ — /ls) cor- 
responding to those descending local minima are shown in Fig. 2, (d) through (f) 
respectively. We see that as the level offset moves downwards, starting from the 
highest local minimum at /li, the convex monotonic transformation suppresses 
/ values which are equal to or greater than this local minimum value. At the 
same time, those minima lower than the current level offset are stretched further 
apart with their relative ordering preserved. When the level offset reaches the sec- 
ond lowest minimum value (/ls), all other local minima are flattened to almost 
zero as seen in the transformed function, except the case for (p{f — /^s) « 0.5 
and the case for a higher value of transformed global minimum. Notice that the 
transformed function becomes sharper as the level offset approaches the global 
solution. Existence of a cutting level C, on the transformed function becomes 
more obvious as the search proceeds. Hence the convergence in locating such a 
cutting level. 

4 Global Descent Search 

Having identified those regions that contain the global optimal solutions in previ- 
ous section, the well-established constrained search methodologies can be applied 
to locate the global solutions. Here, we adopt the penalty function method to 
arrive at a simple and easily reproduceable search utilizing only information 
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from first-order partial derivatives of the objective function. In all subsequent 
developments, we use 0(-) = e“^'^ as the transformation function. 



4.1 Penalty function method 

To perform a search within those regions containing the global minima, the 
unconstrained minimization problem is re-formulated into a constrained one. 
Here, those regions defined by a cutting level on the transformed function are 
used as constraints to the minimization problem. In a more formal way, problem 
{V) in (1) is re-formulated as {V') given by (8) and (9), and a search is performed 
on {V). 

Consider the translated minimization problem min(/ — i^), ^ G TZ. Since 
only translation is involved, the problem of locating d* from minimizing / is 
equivalent to that of minimizing ( / — ^. Suppose ^ ^ fg, then for s = (/ — 
we can further treat 



argmin {f{9) — 0 — argmin s{6). (12) 

0 0 

Thus, instead of performing a search directly on /, we can perform a search on 
s for G* when C ^ /g- This dependency on ^ will be removed in subsequent 
derivation of a search algorithm. 

Suppose there are I constraint functions which specify conditions in (9): 

h(0) = [/q(6»), h2{G), • • ■ , hi{e)f < 0, (13) 

where hi{6) = and ft. 2 (d), •••, hi{9) are defined by the boundaries 

given by the domain of interest T> (e.g. h2{9) = -0i-lO ^ 0, /i 3 ( 0 ) = 6>i-20 < 0, 
h4{9) = —02 — 5 < 0, and so on). 

Let 



hj{9) = max{0, hj{9)}, j = 1, 2, ..., I 
and define for j = 1, 2, ..., I, 



Vhg{9) 



\/hj{9) if hj{9) ^ 0 
0 ifhj{9)<0. 



Then (15) can be packed for j = 1, 2, ..., I as 

= Vh(6») = [Vhi{9), Vh2{9), ■■■ , Vhi{9)] G 



(14) 



(15) 



(16) 



Using a quadratic penalty function [13], the above constrained minimization 
problem can be written as: 



(Pc): argmin q(c,0) 

0 



(17) 



where 



q{c,G) = s{G)+cP{9), (18) 

I 

P{9) = i^Chjie)) = h^{9MG) = , 



(19) 
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and c > 0 is a large penalty coefficient. hj{0) is defined in (14). 

In [13], it has been shown that as c — > oo, the solution to the minimization 
problem {Vc) given by (17) will converge to a solution of the original constrained 
problem {V') given by (8). Here, we shall consider the problem of finding the 
solution of {Vc) (17)-(19) in the sequel. 

Let = V/. For the unconstrained objective function given by s{9) and the 
penalty function given by P{9) in (19), we note that their first- and second-order 
partial derivatives are written as: 

Vs(0) = 2F^(/ - 0, V2s(0) = 2(F^F + (/ - OVV) = 2(F^F + R,), 

VP(6») = 2H^h, V^PiO) = 2{U^U + Rh)- 

Also, the functional dependency on 0 (and so in the subsequent derivations) are 
omitted when clarity is not affected. The following descent algorithm is derived 
to search for the global minima. 

4.2 A global descent algorithm 

It can now be assumed further that the first partial derivatives of P{0) are 
continuous including those points when hj{9) = 0, j = 1,2, ...,l following [13]. 
Taking the quadratic approximation of q{c, 9) (17) about 9q, we have 

q{c, 9) « s{9o) + {9- 9o)^Vs{9o) + - 9o)'^V^s{9o)i9 - 9o) 

+ c[P{9c) + {9- 9o)^VP{9c) + ^{9 - 9o)^V^P{9o){9 - 0„)]. 
The first-order necessary condition for optimality requires that 

Vs(6»o) + \7^s{9o){9 - 9o) + c[VP{9o) + V‘^P{9c){9 - 0„)] = 0, (20) 

which can also be written as 

9 = 9o- [V^s{0o) + cV^P(9o)r^ (Vs(0o) + cVP(0o)) 

= 9o- [f^F + R, + cH^H + cRJ (f^(/ - 0 + cH^h) . 

If we drop the second-order derivative terms R^ (which is unknown) and R/i, 
we can formulate a search algorithm as follows: 

9,+i =9, [f^F + cH^h] (f^(/ - e) + cH^h) . (21) 

Since / is always greater than ^ for all C ^ /g > the term (/ — 0 can be replaced 
by a unit positive term. The required magnitude of search is then taken care of 
by a line search procedure. Hence, no specihc knowledge about ^ (and hence 
/*) is needed. To avoid the search from ill-conditioning when F^F -|- cH^H is 
singular, the above algorithm can be further modified according to Levenberg- 
Marquardt’s method [12, 14] as follows: 



{A) : 9i+i =9i-l3 F^F -h cH^H -h al 



1 



F^ -h cH^h 



( 22 ) 
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where 0 < (a, f3) ^ 1 and I corresponds to a (p x p) identity matrix. In this 
application, a simple line search sequence is constructed as: /3fe+i = (3k /‘2-, k = 
0,1,2,.... 

Remark 3:- Since the local maxima of 0 are ‘flattened’ by raising (( to high 
power, the resultant constraint function characterizing the global optimality may 
be ill-conditioned, i.e. its derivative at regions other than neighborhood of global 
optima may also be too ‘flat’ to be useful. Hence, implementation of algorithms 
utilizing derivatives of the constraint function requires careful selection of 7 such 
that the global optima are segregated and, at the same time, it does not cause ill- 
conditioning of the constraint function. For the case of 4>{-) = e~k\ introduction 
of Jl into the transformed functional constraint (i.e. (f>~^{f{6) — Jl) ^ C) can 
help in achieving a good scaling effect since ^ 

Moreover, the penalty constant c provides an additional scaling means for the 
constrained direction H^H. We shall show in our numerical examples some 
practical choices of 7, c, /l and p. □ 

5 Benchmark Experiments 

To illustrate the effectiveness of a search utilizing regions characterized by the 
proposed monotonic transformation as compared to one that does not use trans- 
formation, we conduct numerical experiments on algorithm (^), adopting those 
level cutting methods according to section 3.3(i) and (ii) (abbreviated as GTL(i) 
and GTL(ii) respectively), and on an algorithm (GOL) which uses a direct cut- 
ting level on the original function to define feasible regions containing the desired 
solution (i.e. T/(e) = {6\f{6) ^ e, 6 e V}). Several benchmark problems (see 
Table 1 and [17]) which are considered to be useful for evaluating different as- 
pects of nonlinearity are experimented. 

The performance criteria compared include the number of gradient evalu- 
ations (equal to the iteration number i in the algorithm) and the number of 
function evaluations required to arrive at the global minimum within a certain 
level of accuracy. All examples use those corners defining the domain of interest 
as initial estimates according to comparative results from the literature. To pro- 
vide an idea on the physical computing speed, the GPU time taken in standard 
units (the real time for optimization divided by the real time for 1000 evaluations 
of Shekel-5 at (4, 4, 4, 4)) for each search processes are also recorded. Noting that 
the error tolerances as seen in [2] and [1] were set to 1% and 10 respectively, 
we think an accuracy of 0.1% with respect to the known global minimum value 
would be suitable for algorithm termination in these applications. 

For GTL(i), since the global minimum function values (/*) are assumed to 
be known, we set the scaling parameter p = with = /* throughout all 
experiments. As for the case of unknown global minimum value, the procedure 
in section 3.3(ii) is adopted for GTL(ii) starting with Jl = f{do) ~ (^0 is 

the initial estimate) and lower this /l value with level step-size 5f^ whenever 
./(^i) ^ /i as the search proceeds. 
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For comparisons to be consistent, similar parameter settings are maintained 
for all GOL, GTL(i) and GTL(ii) search procedures (c = 10® and a = 0.0001). 
The cutting level constraint on the non-transformed function for GOL is set 
according to the desired accuracy (e = 0.1% in Lf{e)). Other parameters for 
GTLs are chosen as: 7 = 4 for both GTL(i) and GTL(ii); ( = 0.9999 for GTL(i) 
and ( = 1.04 for GTL(ii) throughout except for the choices of level step-size Sf^ 
which are shown in the legends of Table 1. Here, we note that a minimum value 
of /l has been set for S5, S7 and SIO at —1.3, —1.4 and —1.3 respectively. 

On top of the constraint imposed by global characterization, all search proce- 
dures are performed with additional constraints which are defined by the bound- 
ing box specifying the domain of interest. To make the global constraint function 
more dominating, a scaling factor of 100 was used. 

5.1 Comparative results 

For all functions, as in other comparative reports (e.g.[l]), every corner of the 
bounding box defining the domain of interest was chosen as initial estimate for 
the iterative search. The average number of function evaluations for those runs 
that achieved the desired solutions are summarized in Table 1, together with 
best known reported results as seen in [1, 2, 9]. The average GPU times for each 
function in standard units are presented in Table 2. From Table 1, except for GP, 
BR and H3 problems, significant improvements are seen for GTLs over GOL in 
terms of convergence of search arriving at desired solutions. While GTL(i) and 
GTL(ii) searches remained to be simple in design, we note that their speed of 
convergence in terms of number of function evaluations required and standard 
CPU time taken are comparable to, if not better than, the fastest global search 
algorithm reported. Also, we note that the number of function evaluations re- 
quired for GTL(i) and GTL(ii) is apparently independent of the dimension of 
problems. 

Remark 4:- Main reason for the fast convergence in above applications is that 
there is no re-start along the search so long as the choice of /l or Sf^ provides 
good scaling of the constraint function. This is different from most existing de- 
sign of global search engines. Effectively, the constrained algorithm when well 
implemented, shall lead to global optimal solutions with high probability. How- 
ever, in its present form, we note that the algorithm cannot guarantee to attain 
the global solutions. Nevertheless, the proposal remains a simple and scientif- 
ically reproduceable means to be further explored. We shall leave those issues 
related to good implementation of the search algorithm for our future work. In 
the following, we show that this simple search can improve local search in arti- 
ficial neural network learning problems. □ 

6 FNN Application: Parity Patterns Learning 

Pattern learning using the artificial Neural Networks represents an important 
application area in Computer Vision and Pattern Recognition (see e.g. [11]). 
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Table 1. Best known results in terms of number of function evaluations required by different methods 





1 Test function | 


Method 


GP 


BR 


GA 


RA 


SH 


H3 


H6 


S5 


S7 


SIO 


SDE 


15439 2700 


10822 


- 


241215 3416 


- 


- 


- 


- 


EA 


460 


430 


- 


2048 


- 


- 


- 


- 


- 


- 


MLSL 


148 


206 


- 


- 


- 


197 


487 


404 


432* 


564 


lA 


- 


1354 


326 


- 


7424 


- 


- 


- 


- 


- 


TUN 


- 


- 


1469 


- 


12160 


- 


- 


- 


- 


- 


TS 


486 


492 


- 


540 


727 


508 2845 


- 


- 


- 


ACF 


394 


20 


52 


158 


- 


78 


249 


- 


- 


- 


TRUST 


103 


55 


31 


59 


72 


58 


- 


- 


- 


- 


GOL 


385 


43 


166* 


*4= 


187* 


179 743* 


1146* 


1011* 


775* 


GTL(i) 


** 


57 


44 


303 


239* 


213 


787 


429* 


58* 


109* 


GTL(ii) 


222 


89 


16 


49 


313 


227 


63 


127 


60 


71 



* Unable to locate the desired solution from some initial points. 

** Unable to locate the desired solution from all stated initial points. 

Note; SDE is the stochastic method of Aluffi-Pentini et. al.; EA is the evolution algorithms of 
Yong et. al. or Schneider et. al.; MLSL is the multilevel single-linkage method of Kan and Timmer; 
lA is the interval arithmetic technique of Ratschek and Rokne; TUN is the tunneling method of 
Levy and Montalvo; TS is the Taboo search of Cvijovic and TRUST is the method of Barhen 
et. al. (see [1,9] for detailed references and results). Choices of 5fj^ in GTL(ii) are: GP(2.55), 
BR(50), CA(IO), RA(0.85), SH(0.S5), H3(1.15), H6(3.9), S5(0.5), S7(2), S10(0.5). 



Table 2. Average CPU time (unit: 1000 S5 evaluations) 





Test function 


Method 


GP BR CA RA SH H3 H6 S5 S7 SIO 


EA 

MLSL 


- 1.50 - 

- 0.25 _ _ _ 0.50 2.00 1.00 1.00* 2.00 


GOL 

GTL(i) 

GTL(ii) 


0.86 0.10 0.55* ** 0.28* 0.73 2.56* 3.23* 3.05* 2.48* 

** 0.14 0.10 0.44 0.64* 0.61 2.42 1.08* 0.21* 0.38* 

0.31 0.14 0.03 0.07 0.75 0.50 0.16 0.39 0.25 0.27 



* Unable to locate the desired solution from some initial points. 

** Unable to locate the desired solution from all stated initial points. 



In this section, we apply the penalty-based algorithm to a class of neural net- 
work learning problems which are much more complex than those benchmark 
problems. Due to the complexity of physical data for approximation and the in- 
terconnected nonlinear activation units, the learning problem of artificial neural 
network represents a highly nonlinear minimization problem with a large number 
of minimizing parameters (network inter-connection weights) to be solved. 

Given a network function y(x, w) E TZ (see [16] for details) where x e TZ™ and 
w G TZ^ represent the m-dimensional network input vector and the p-dimensional 
network weight vector respectively. A commonly adopted learning objective is 
to fit a certain target data by minimizing the 12 -eiTor norm given by: 



s{w) = -y{xi,w)Y , (23) 

i=l 

where yj, i = l,2...,n is the sample target vector with n data elements. It has 
been shown by several researchers (see e.g. [4,3,5]) that multilayer neural net- 
works, under certain conditions, can approximate any nonlinear function to any 
desired accuracy using different techniques such as Kolmogorov’s functional su- 
perposition theorem [10], Stone- Weierstrass theorem [see e.g. [15]] and projection 
pursuit method [8]. Hence, this minimization objective belongs to the class of 
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problems with a known lower bound for global minimum value, when a proper 
network size is selected. 

A strictly two-layer feedforward neural network (FNN) is used to learn several 
parity functions which are essentially the generalized XOR pattern recognition 
problem. The output of the network is set to ‘1’ if an odd number of inputs are 
ones, and ‘0’ otherwise. Since the output changes whenever any single input unit 
changes, parity problems are considered challenging for evaluating network learn- 
ing performance. Our aim here is to demonstrate a possible use of the penalty- 
based algorithm for problems with relatively large number of parameters. As 
comparable global search results are not available for these examples, we shall 
compare the global descent algorithm (GTL(i) with /^ = /* = 0, p = 10 e'>'A 
and C = 9.9999; hereon denoted as GTL) with its local line search counter-part 
(LS) in terms of the speed of convergence and percentage of trials reaching the 
global solution. Also, we shall observe the effect of increasing complexity of the 
target function on the convergence of the penalty-based algorithm. 

The number of hidden- layer neurons {Nj) has been chosen to be the same 
as the number of input units in each case. Table 3 shows the algorithm settings 
for the parity problems considered. Here, we note that p is the dimension of 
the minimizing parameter space. Our largest problem to be solved is p = 81. 
The training error goal (Err, which is the sum of squared error, as shown in the 
table) was set at 0.025 for all cases. The settings for the penalty-based search 
parameters were chosen to be similar for most cases as far as nonsingularity of 
a normalization matrix for the constraint function was maintained. 

The experiment was carried out with 100 trials for each parity problem us- 
ing random initial estimates from [0, 1]^. The results for local search and global 
search are presented in Table 4 (LS: local line search, GTL: global method). To 
exclude extreme distributions which can affect the dominant shape of distrib- 
ution, the statistics (mean, standard deviation) are obtained discarding those 
with number of iterations exceeding the maximum value indicated. To reflect 
the physical computation speed with respect to an IBM Pentium (266 MHz) 
compatible machine, the average CPU time for 10 iterations are also recorded 
in the table. 

Table 3. Settings for parity*3,4,6,8 functions 



Fn 


Nj p 


a 7 c p ^ Err Goal 


P3 


3 16 


0.0001 4 10 10 9.9999 0.025 


P4 


4 25 


0.0001 4 10 10 9.9999 0.025 


P6 


6 49 


0.0001 4 1000 10 9.9999 0.025 


P8 


8 81 


0.0001 2 1000 10 9.9999 0.025 



As seen from Table 4, the GTL algorithm as compared to the local LS 
method, shows excellent improvement in terms of the number of trials (GOP) 
attaining the desired error goal for all cases. For the parity-3 and the parity- 
4 problems, the mean number of iterations is seen to be much lower for GTL 
algorithm as compared to that of LS algorithm. The slow convergence for the 
parity-6 problem and the parity-8 problem is a result of high penalty setting 
that contributes to a much ‘flattened’ penalized error surface. Due to signili- 
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Table 4. Local and global search results for parity-3,4,6,8 functions 



LS 




CPU 


Number of iterations 


GOP 


Case 


Fn 


(sec) 


min max mean std dev. 


{%) 


(a) 


P3 


0.67 


9 > 500 66.68 98.57 


53 


(b) 


P4 


0.84 


12 > 500 70.14 96.28 


49 


(c) 


P6 


1.66 


22 > 1000 67.89 45.17 


36 


(d) 


P8 


7.30 


37 > 1000 256.12 234.42 


17 


GTL 




CPU 


Number of iterations 


GOP 


Case 


h'n 


(sec) 


min max mean std dev. 




(e) 


P3 


0.79 


10 > 500 20.81 10.03 


99 


(f) 


P4 


1.05 


21 145 38.46 20.01 


100 


(g) 


P6 


2.31 


144 > 1000 399.33 170.29 


90 


(h) 


PS 


11.14 


147 > 1000 329.24 100.83 


72 



cant increment in dimension of weight space, the CPU times for GTL algorithm 
show significant increment for high parity problems as compared to those of LS 
algorithm. 

In this FNN example, we have shown the effectiveness of the transforma- 
tion based global descent design via minor increase in cost of computational 
complexity and design effort as compared to many existing global optimization 
algorithms, though there remains rooms for improvements to guarantee attaining 
the global optimal solution. The reader is referred to [16] for more application 
examples. 



7 Conclusion And Future Work 

Inspired by a visual warping effect, we provided a useful observation on global 
optimality of functions under monotonic transformation. Based on this obser- 
vation, a global descent search procedure is designed to minimize the energy 
functions arising from applications. Comparing to an algorithm that does not 
use this transformation, we have shown that our approach can improve numeri- 
cal sensitivity, and hence good numerical convergence, for several penalty-based 
search applications. 

When a suitable level step size is seleeted, the simple penalty-based con- 
strained search algorithm is shown to be comparable to best known global opti- 
mization algorithms in terms of number of function evaluations used, and stan- 
dard CPU time taken to reach the region of the global optimum on the evaluation 
of several benchmark problems. The global descent search has also been shown 
to be applicable to a class of neural network pattern learning problems with 
significant improvement, on the probability to reach a neighborhood of global 
optimal solution, over the local line search method. 

Several areas which have been identified for our future investigation include: 
improvement of the constrained minimization search and application to com- 
plex computer vision and pattern recognition problems such as support vector 
machine training and face recognition. 
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Abstract. This paper addresses the issues of global optimality and 
training of a Feedforward Neural Network (FNN) error funtion incor- 
porating the weight decay regularizer. A network with a single hidden- 
layer and a single output-unit is considered. Explicit vector and matrix 
canonical forms for the Jacobian and Hessian of the network are pre- 
sented. Convexity analysis is then performed utilizing the known canon- 
ical structure of the Hessian. Next, global optimality characterization 
of the FNN error function is attempted utilizing the results of convex 
characterization and a convex monotonic transformation. Based on this 
global optimality characterization, an iterative algorithm is proposed for 
global FNN learning. Numerical experiments with benchmark examples 
show better convergence of our network learning as compared to many 
existing methods in the literature. The network is also shown to gener- 
alize well for a face recognition problem. 



1 Introduction 

Backpropagation of error gradients has proven to be useful in layered feedforward 
neural network learning. However, a large number of iterations is usually needed 
for adapting the weights. The problem becomes more severe especially when a 
high level of accuracy is required. It has been an active area of research, between 
late 1980s and early 1990s, to derive fast training algorithms to circumvent the 
problem of slow training rate as seen in the error backpropagation algorithm. 
Among the various methods proposed (see e.g. [6, 14]), significant improvement 
to the training speed is seen through the application of nonlinear optimization 
techniques in network training (see e.g. [1,2,22]). Very often, this is achieved 
at the expense of heavier computational requirement.^ Here, we note that most 
of them are local methods and training results are very much dependent on the 
choices of initial estimates. 

In light of efficient training algorithm development and network pruning (see 
[5], page 150-151), exact calculation of the second derivatives (Hessian) and its 

^ For example, the complexity of each step in Newton’s method is 0{n^) as compared 
to most first-order methods which are 0{n). 
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multiplied forms were studied [4,16]. In [4], it was shown that the elements of 
the Hessian matrix can be evaluated exactly using multiple forward propagation 
through the network, followed by multiple backward propagation, for a feed- 
forward network without intra-layer connections. In [16], a product form of the 
Hessian which took about as much computation as a gradient evaluation was pre- 
sented. The result was then applied to a one pass gradient algorithm, a relaxation 
gradient algorithm and two stochastic gradient calculation algorithms to train 
the network. From the review in [7] specifically on the computation of second 
derivatives of feedforward networks, no explicit expression for the eigenvalues 
was seen. Here we note that eigenvalues are important for convexity analysis. 

In view of the lack of an optimality criterion for global network learning in 
classification and regression applications, we shall look into the following issues in 
this paper: (i) characterization of global optimality of a FNN learning objective 
incorporating the weight decay regularizer, and (ii) derivation of an efficient 
search algorithm based on results of (i). We shall provide extensive numerical 
studies to support our claims. 

The paper is organized as follows. In section 2, the layered feedforward neural 
network is introduced. This is followed by explicit Jacobian and Hessian formu- 
lations for forthcoming analysis. The FNN learning error function is then regu- 
larized using the weight decay method for good generalization capability. Then 
convexity analysis is performed in section 3 for local solution set characterization. 
This paves the way for a new approach on global optimality characterization in 
section 4. In section 5, the analysis results are used to derive a search algo- 
rithm to locate the global minima. Several benchmark examples are compared 
in section 6 in terms of convergence properties. The learning algorithm is also 
applied to a face recognition problem with good convergence as well as good 
generalization. Finally, some concluding remarks are drawn. 

Unless otherwise stated, vector and matrix quantities are denoted using bold 
lowercase characters and bold uppercase characters respectively to distinguish 
from those scalar quantities. The superscript ‘T’ on the respective character is 
used to denote matrix transposition. Also, if not otherwise stated, j| • j| is taken 
to be the ^ 2 -norm. 



2 Multilayer Feedforward Neural Network 



2.1 Neural feedforward computation 



The neural network being considered is the familiar multilayer feedforward net- 
work with no recurrent or intra-layer connections. The forward calculation of 
a strictly 2-layer network with one output-node, i.e. network with one direct 
input-layer, one hidden-layer, and one output-layer consisting a single node (i.e. 
network with {Ni-Nj-l) structure) can be written as: 



Es E 






g(^Q ) = gi'WoiXi) = 1 , XQ = 1, k = 1 



y(x,w) = g 



( 1 ) 
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where x = {xi,i = 1,2, Ni} denotes the network input vector and w = 
{[wfcj]i=o,...,iVj;fc=i, [wji]i=o,...,Ni-,j=i,...,Nj} denotes the weight parameters to be ad- 
justed. g{-) is a sigmoidal function given by 



g{-) = 



1 

l + e-(-) ■ 



( 2 ) 



The superscripts and denote weights that are connected to output- nodes 
and hidden-nodes respectively. 

For n number of data points, input x becomes a n-tuple vector (i.e. x G 
j^{Ni+i)xn number of input-nodes plus a constant bias term) and so is 

the output y (i.e. y G 7?."). 



2.2 Explicit Jacobian and Hessian for a two- layer network with 
single output 

Here, we stack the network weights as a parameter vector w G TZ^ {p = {Nj + 
1) + {Ni + l)Nj) as follows: 

k o o h h h h iT /r>\ 

fe.Oi •••> 1 Wl,0) •••> 1 •••: • (d) 

Consider a single data point, the first derivative of network output y{x,w) (1) 
with respect to the weights vector can be written as 

J{w) = V'^y{x,w) = g{z^)r'^ , (4) 



where 

r = ^l,g{zi),...,g{zN.), 5(2i)wfc_iu^, ••• g{z%.)wl^N.u^'^ G 7^^, 
u = [l,a;i,...,a:ivj'^ G 



Nj 

3=0 

Ni 

4 = g{zo) = g{woiXi) = 1, xo = 1, j = 1, ..., Nj . 

i=0 



(5) 

(6) 

(7) 

(8) 
(9) 



The second derivative which is also termed the Hessian for the network func- 
tion y{x,w) (1) is then 

Y = V^y{^,w)=g{z°k)P + g{z°k)Q, (10) 



where 



5(0 



^7 — 1 j e ^7 
(l + e-())" 



( 11 ) 



P = rr^. 



( 12 ) 
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-0 0 0 ■ 0 

: : g(z'l)u^ o 

Q = 0 0 0 

0 S(z5*)u 0 g(z'l)w‘, 

0 0 a(z%.)u 0 

2.3 Learning and generalization 

Since the goal of network training is not just to learn the given sample training 
data, here we adopt the weight decay (see e.g. [5,9]) regularization method to 
provide some degree of network generalization. 

Consider a target learning vector given by yj,i = 1, ..., n, the following learn- 
ing objective for FNN y{xi,w) is considered: 



s(w) 


= si(w) + b S 2 (w), 6^0 


(14) 


where 






si(u;) 


n n 

' = ^ y(xi,ia))^ = = e^{w)e{w), 

i=l i=l 


(15) 


and 






S2(w) 


11 l|2 T 

\ = \\ W \\2 = W W. 


(16) 



In what follows, we shall retain si as the minimization objective for regression 
applications (or by setting 6 = 0 for s in (14)) since only fitting accuracy is 
required. We shall observe how the learning objective in (14) can influence gen- 
eralization of FNN learning for classification in a face recognition problem. 




3 Convexity Analysis 

In this section, we shall analyze the convexity of the more general s such that 
local optimal solutions can be found. The convexity of si can be directly obtained 
by setting 6 = 0 from that of s. According to ([17], Theorem 4.5), convexity of 
a twice continuously differentiable function is conditioned by the positive semi- 
definiteness of its Hessian matrix. 

To determine explicit conditions for our application, consider the ^ 2 -norm 
training objective given n training data in (14). The Hessian of s{w) can be 
written as 

V^s(io) = 2 [E^(ia)E(ty) R{w) + 6l] , (17) 



where E(u;) is the Jacobian of e(w) and 

E^(u;)E(u;) = (— J)"^(za)(— J)( k;) = J"^(io)J(k;) 
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The positive semi-definiteness of V‘^s{w) is thus dependent on the matrices 
61 and R(it)). By exploring the structure of the above Hessian, we 
present our convexity result as follows. We shall prove two lemmas before we 
proceed further: 



Lemma 1. Consider v = [ui, U 2 , G TZ^ for some Vi yf 0,i = 

Then, the eigenvalues of vv^ are given by 

A(vv^) = (19) 



Proof: Since vv^ is symmetric with rank one, we have one and only one real 
non-zero eigenvalue which is the trace of vv^. This completes the proof. ■ 



Lemma 2. Consider v = [v\,V 2 , ■■■, for some Vi ^ 0,i = 1, 2, ..., n and 

-0 0 0 ■ ■ • 0 

aiv^ 0 



A = 



0 . . . 
0 aiv 



0 0 
0 6ivv^ 



^Tl— 1 V 
0 



e n" 



( 20 ) 



_0 0 a-n-l'V 0 ••• bn-l'VV^ J 

Then, the eigenvalues of A are given by A(A) = {Ai, ..., A 2 („_i), 0, ..., O} , where 



A • - i 



b,V±^{h,Vf+A{a]V)\ , j = l,2,...,n-l , andV = ^vl 



2=1 



Proof: Solve for the eigenvalues block by block will yield the above result. 



Now, we are ready to perform convexity analysis for local solution set char- 
acterization. The FNN learning problem addressed, in a more precise manner, 
is stated as: 



Problem 1 The problem of FNN learning is defined by the h-norm minimiza- 
tion objeetive given by (If) where y{xi,w) is a (Ni-Nj-l) network defined by 
(1) with sigmoidal aetivation funetions given by (2). The Jaeobian of y{xi,w), 
i = 1,2, ..., n is denoted by J^(w) = [Vy(xi, w), Vy(x 2 , w ), ..., Ny(xn, w)] where 
each of Vy{xi,w),i = l,2,...,n is evaluated using (4)-(9). 

First we present a first-order necessary condition for network learning as 
follows: 

Proposition 1. Civen Problem 1. Then, the least squares estimate w of w = 
[u)i, tC 2 , ..., Wp]'^ , p = {Nj + 1) + {Ni + l)Nj in the sense of minimizing s(w) of 
(If) satisfies 

w - J'^{w)[y* - y(x, w)] = 0. 



( 21 ) 
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Proof: Minimizing s{w) with respect to w by setting its first derivative to zero 
yields the normal equation given by (21). w is the point satisfying (21). Hence 
the proof. ■ 

The convexity result is presented as follows: 

Theorem 1 Given Problem 1. Then s{w) is convex on w e Wc if 



[(5(«fe)Am(P)) , + (<?(2fe)Am(Q)) J Ci < b, 

i=l 


(22) 


where 






Am(P) = 


^ 3= 0 . 

, 2 = 1 , 2, .... n , 


(23) 




[ 0 Ci < 0 




["Fl , P2, •••, ? 




(24) 


Am(Q) = j 


fmax A^.e. ^0 

[ mirij Aj , €i < 0 ’ ’ ’ ■’ 0 ■> 


(25) 



A. = i 9 (4) ± \/(s {-f)v J = 1, 






with g{-), g{-), z^. and Zj being given by (7) through (9) and (11). Moreover, 
when the inequality in (22) is strict, strict convexity results. 

Proof: For eonvexity(strict convexity) of s{w), we need its Hessian to be positive 
semidefinite(definite). From (10), (17)-(18), we know that matrices J^J, R, I 
are symmetric. Hence, for the Hessian [J^J + R + &I] to be positive semidefinite 
(definite), it is sufficient to have 

Amin(J J) + Amin(R) + & ^ 0, (28) 

by Weyl’s theorem (see [10], p.l81). Substitute Amm(J^J) = 0, (10), (12) and 
(13) into above, and apply Weyl’s theorem again, we further have 



n n 

b - ^Am ^g(zk)pj Ci - ^Am (^5(«fc)Q) Ci ^ 0 



where Am(-) = Xmax{-) when 5s 0 and \m{-) = A™„(-) when a < 0. Denote 
the trace of a matrix A by tr(A). Aecording to Lemma 1, we have for < 0 



A^ (<7(2^)p),= 0, i = l,2,.... 



and for e, > 0 



Am (^fl(Zfc)pj = (^5(2fc)tr(rr'^) j = ^9{Zk)Yyi^ , i = 1,2 , ..., n, 
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where r is given in (5). By Lemma 2 with adaptation of A G TZ^ to 
Q G , p = {Nj + 1) + {Ni + l)Nj, we have 2Nj number of non-zero eigen- 
values which are identihed as in (26) and (27). This completes the proof. ■ 

Remark 1:- By exploiting the known canonical structure of the Hessian of 
the FNN, Theorem 1 presents an explicit convexity condition using eigenvalue 
characterization. The characterization is general since it can be applied to non- 
sigmoidal activation functions by replacing g and its derivatives with other ac- 
tivation functions. Here, we note that local optimal solution set can be charac- 
terized by Proposition 1 and Theorem 1. These results will be incorporated into 
global optimality characterization in the following section. □ 



4 Global Optimality 

In this context, we refer to the solution of a minimization problem as: 

Definition 1 Let f be a function to be minimized from V to TZ where V C TZ^ 
is non-empty and compact. A point 6 & T> is called a feasible solution to the 
minimization problem. If 0*g & T> and f{9) ^ f{9*g) for each 0 G 2 ?, then 6*g is 
called a global optimal solution (global minimum) to the problem. If 6* ^T> and 
if there exists an e -neighborhood N^{6*) around 6* such that f{6) ^ f{d*) for 
each 6 G T> n N^{9*) , then 9* is called a local optimal solution (local minimum). 
The set which contains both local optimal solutions and global optimal solutions 
is called a solution set (denoted by 0*). 



4.1 Mathematical construct 

Denote IZ'^ = (0,oo). Consider a strictly decreasing transformation f on the 
function to be minimized. We shall use the following result (see [20] for more 
details) for global optimality characterization of a FNN error function. 

Proposition 2. Let f : T> —> TZ be a continuous function where T> C TZ^ is 
compact. Let </> : 7?. — > TZ'^ be a strictly decreasing function. Suppose 9* G V. 
Then 9* is a global minimizer of f if and only if 

= 0 ’ ^ ^ ( 32 ) 

7-00 <p^(f[0 )) 

Proof: This follows from the fact that > 0 so that lim.y^oo = 0 

is equivalent to <1- ® 

Consider the solution set given by Definition 1, and using a more structured 
convex transformation (f, the following proposition is a straightforward conse- 
quence. 
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Proposition 3. Let f : V ^ TZ be a continuous function where V C VL is 
compact. Let f : TZ ^ TZ'^ be a strictly decreasing and convex function. Denote 
by 6 >* the solution set given by Definition 1. Suppose 6 * e 6 >*. Then 6 * is a 
global minimizer of f if and only if 

=0’ e 0*- /W 7^ /(«*)• (33) 

7^00 )) 



4.2 Global optimality of the modified FNN error function 

Consider a feedforward neural network (FNN) with the training objective given 
by (14). Let v(w) be a convex monotonic transformation function given by: 

v(w) = fd{s{w)) = w e 'RT , p > 0, 7 > 1. (34) 

Notice that v{w) e = (0, oo) for all finite values of p, 7 and s{w). If a lower 
bound or the value of global minimum of s{w) is known, then we can multiply 
s[w) by p = for scaling purpose: 

v{w) = ^ 

This means that the maximum value of v{w) can be pivot near to or at 1 while 
all other local minima can be “flattened” (relative to global minima) using a 
sufficiently high value of 7 . For FNN training adopting a ^ 2 -norm error objective, 
the ultimate lower bound of s{w) is zero. For network with good approximation 
capability, the global minimum value should be a small value where this zero 
bound provides a good natural scaling. 

Noting that the FNN considered is a continuous function mapping on a com- 
pact set of weight space w G W, characterization of global optimality for the 
FNN training problem is presented as follows: 



Theorem 2 Consider Problem 1. Denote by W* the solution set which satisfies 
Proposition 1 and strict convexity in Theorem 1. Let v{s{w)) = p > 0. 

If there exists 70 > 1 such that 



v{s{w)) 1 

ti(s(iu*)) ^ 2 ’ 



s{w) 7^ s{w*), 



(36) 



for all 7 ^ 7 o and for all w G W*, then w* is a global minimizer of s{w). 



Proof: The solution set W* which satisfies Proposition 1 and strict convexity 
in Theorem 1 defines sufficiency for local optimality of the FNN error function 
(14). Let w G W* where s{w) ^ s{w*). By the hypothesis in the theorem, there 
exists 7o > 1 such that (36) holds for all 7 ^ 7 o- Thus ^ ^ < 1 for 

7 > 7 o- Passing to limit, we have lim..y^oo = 0. By Proposition 3, w* is 

a global minimizer. ■ 



Remark 2:- Theorem 2 shows that if we can find a 7 ^ 7 o such that (36) is 
satisfied, then a level C > 0 can be found to segregate the global minima from all 
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other local minima on the transformed error function v. For FNN training prob- 
lems adopting a ? 2 -norm error function si, the global optimal value is expected 
to approach zero if network approximation capability is assumed (i.e. with suf- 
hcient layers and neurons). Hence, we can simply choose a cutting level ^ to be 
slightly less than 1 (since u( 0 ) = 1 when p = 1 ) on a transformed function v with 
sufficiently high value of 7 . We shall utilize this observation for FNN training in 
both regression and classification problems. □ 



5 Network Training 

5.1 Global descent search 

In this section, we show that the results of global optimality characterization 
can be directly applied to network training problem. Here, we treat network 
training as a nonlinear minimization problem. To achieve global optimality, the 
minimization is subjected to optimality conditions defined by Theorem 2. Math- 
ematically, the network training can be written as: 

min s{'w) subject to w E Wj, (37) 

XV 

where W* defines the solution set containing the global minima according to 
Theorem 2. Here, the condition given by Proposition 1 is used for iterative search 
direction design (e.g. Gauss-Newton search). While the convex characterization 
can be used for verification purpose, the global condition in Theorem 2 is used 
as a constraint to search for w over a high cutting level on the transformed 
function: 

u(s(u;)) > C or h{w) = ( — v{s{w)) < 0 , (38) 

where C can be chosen to be any value within (i, 1 ). 



5.2 Penalty function method 

Suppose there are I constraint functions^ which are put into the following vector 
form: h(m) = [hi{w),h 2 {w), - ■ ■ ,hi{w)]^ ^ 0. Let hj{w) = maxjO, hj{w)}, 
j = 1 , 2 , ..., I and define for j = 1 , 2 , ..., I, 



Vhjiw) 



Vhj(w) if hj(w) ^ 0 
0 if/ij(iu)< 0 . 



(39) 



Using the more compact matrix notation, (39) can be packed for j = 1,2,...,^ 
as H^(ic) = Vh(ic) = \S7hi{w),Vh2{w), ■ ■ ■ ,Vh;(tc)] G By the penalty 

function method [13], the constrained minimization problem of (37) can be re- 
written as: 



(Pc) : min g(c, w) = s(w) + cP(w) (40) 

^ Apart from constraint arising from (36), the boundaries of the domain of interest 
can also be included as constraints. 
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where 

i 

P{w) = ip{hj{w)) = ^{hj{w))'^ , (41) 

hj{w) = maxjO, hj{w)}, j = 1,2, (42) 

and c > 0 is a large penalty coefficient. 

In [13], it has been shown that as c ^ oo, the solution to the minimization 
problem (Pc) given by (40) will converge to a solution of the original constrained 
problem given by (37). In the sequel, we shall concentrate on finding the solution 
of (Pc) (40)-(42). 

For the unconstrained objective function given by s{w) and the penalty func- 
tion given by P{w) in (40), we note that their first- and second-order partial 
derivatives are written as: 

Vs{w) = — 2j"^e + 2bw, V^s{w) = 2(J^J + Rj,) + 261, 

VP(iu)=2H^h, V^P(ia) =2(H^H + Rh). 

The functional dependency on w (and so in the subsequent derivations) are 
omitted when clarity is not affected. The following algorithms are derived to 
search for the global minima. 

5.3 Algorithm 

It can be further assumed that the first-order partial derivatives of P{w) are 
continuous including those points where hj{w) = 0, j = 1, 2, ..., I following [13]. 
Hence, by taking the quadratic approximation of q{w) (40) about Wo and set 
the first derivative to zero, we have 

w ^ Wo + [j^J + Hy + bI + cH'^H + Rh] (^J'^e - bw - cH^hj . (43) 

If we drop the second-order partial derivatives of network (R^) and the second- 
order partial derivatives of the constraint function (Rft,), we can formulate a 
search algorithm as follows: 

Wi+i = Wi + Ji + 61 + cUf Hij £i — bwi — cH [ . (44) 

By including a weighted parameter norm in the error objective function (14), 
we note that this has resulted in having a weighted identity matrix ( 61) included 
for the term in (44) which requires matrix inversion. This provides a mecha- 
nism to avoid the search from ill-conditioning which is analogous to that of the 
Levenberg-Marquardt’s method. 

To further improve numerical properties, the widely distributed eigenvalues 
of the penalty term can be normalized as shown: 

Wi+i = Wi + l3 Jf Ji + 61 + cflf AHi - 6m - cflfhij (45) 

where A = (H^Hf)^^. Here, we use the Line Search procedure (with /3 chosen 
to minimize the objective function) for iterative search. 
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In the following, we shall compare the global descent algorithm given by (45) 
(denoted as LSGO) with its local counter part, the local line search (denoted as 
LS) obtained from setting c = 0 and bw = 0 in (45). 

6 Numerical Experiments 

6.1 Benchmark problems 

For the following experiments in this subsection, the FNN learning objective is 
chosen to be Si(i(;) rather than s{w) since only training accuracy for regression 
is needed. Here the bw term in algorithm (45) is set to zero while the 61 term is 
retained for numerical stability. For all examples, 100 trials using random initial 
points within the box [0,1]^ were carried out. Training results in terms of the 
number of trials reaching a neighborhood of the desired global minimum and the 
mean number of iterations for these trials are presented. 

As the number of iterations required to reach a desired error goal only pro- 
vides a partial picture of the training algorithm, numerical results on compu- 
tational aspects are also provided. All the experiments are conducted using an 
IBM-PC/Pentium compatible machine with 266Hz clock speed. In the follow- 
ing, we tabulate the average CPU time required to run 10 iterations for each 
algorithm. 

For ease of comparison, recent results (mean number of iterations, percent- 
age of trials attaining near global solution) for the XOR problem from [22] are 
listed: (i) Standard error backpropagation: (332, 91.3%); (ii) Error backpropaga- 
tion with line minimization: (915.4, 38%); (iii) Davidon-Fletcher-Powell quasi- 
Newton minimization: (2141.1, 34.1%); (iv) Fletcher-Reeves conjugate gradient 
minimization: (523, 81.5%); (v) Conjugate gradient minimization with Pow- 
ell restarts: (79.2, 82.1%). The reader is referred to [22] for more results on 
f{x) = sin(a;) cos(2a;) fitting problem which requires much more than 600 mean 
iterations for the above search methods. 

As for existing global optimization algorithms, similar statistical compar- 
isons for these examples are not available. We note that the particular training 
example for XOR given in [8] (TRUST) used about 1000 training iterations to 
reach the global optimal solution. As for the global algorithm proposed by [19] 
(COTA), the convergence speed is reported to be comparable to the backprop- 
agation algorithm for the XOR example. 



Example 1: XOR pattern 

In this example, 4 samples of the XOR input-output patterns were used for 
network training. The network chosen was similar to that in [22] where 2 hidden- 
units were used. The target sum of the squared error was set to be less than 0.025 
which was sufficiently closed to the global optimal solution. For both local and 
global methods, 6 was chosen to be a fixed value of 0.0001 which was sufficient 
to provide a stable numerical conditioning. As for other parameters of the global 
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descent algorithm (LSGO), the following settings were used: 7 = 4,/?= 1 , 
C = 0.9999, c = 10 (7 and p appear in v{w), ( is the cutting level on v{w) as 
shown in (38) and c is a penalty constant). 

Training results comprising of 100 trials for each of the algorithms are shown 
in Table 1. The respective statistics (i.e. min: minimum value, max: maximum 
value, mean: mean value, std dev: standard deviation, and GOP: percentage 
of trials achieving the desired global error objective within 500 iterations) are 
also included in Table 1. In order to show the core distribution for trials which 
took less than 500 iterations, the statistics shown exclude those trials above 500 
iterations. 

Table 1. Results for the Examples 1,2 and 3 



Ex. 1 




CPU 


Number of iterations 


GOP 


Case 


Algo. 


(sec) 


min 


max mean std dev. 


(%) 


(a) 


LS 


0.66 


5 


> 500 13.73 22.46 


33 


(b) 


0 

0 


0.77 


28 


132 39.40 15.76 


100 


Ex. 2 




CPU 


Number of iterations 


GOP 


Case 


Algo. 


(sec) 


min 


max mean std dev. 


(%) 


(a) 


LS 


0.93 


21 


> 5000 94.70 94.76 


50 


(b) 


0 

0 


1.15 


29 


3191 72.34 26.78 


100 


Ex. 3 




CPU 


Number of iterations 


GOP 


Case 


Algo. 


(sec) 


min 


max mean std dev. 




(a) 


LS 


5.50 


- 


>500 


0 


(b) 


0 

0 

O) 


8.41 


49 


> 500 73.16 18.85 


99 



As shown in Table 1, the global descent algorithm (LSGO) have succeeded 
in locating the approximate global minima within 132 iterations for all the 100 
trials using different initial values. This as compared to the local line search 
algorithm (LS), which scores only 33%, is a remarkable improvement. In terms 
of computational cost, the global constrained method is found to take slightly 
higher CPU time than that using unconstrained method. 



Example 2: 1-D curve f{x) = sin(a;) cos(2a;) 

In this example, 20 input-output patterns were uniformly chosen on 0 ^ a: ^ 27 t 
for network training. Similar to the first example, the sum of the squared error 
was set to be less than 0.025 which was sufficiently close to the global optimal 
solution. A single-output network with 10-hidden units was chosen according 
to [22]. As in previous example, b was set to be 0.0001 throughout. For the 
global descent algorithm (LSGO), the following settings were chosen: 7 = 4, 
p = 10, <(■ = 9.9999, c = 10. Training results for 100 trials are shown in Table 1, 
with respective statistics and CPU times. For this example, GOP refers to the 
percentage of trials achieving the desired error goal within 5000 iterations. 

In this example, the global method (LSGO) has achieved a 100% GOP which 
is much better than the 50% for local method (LS). The largest iteration num- 
ber for the case in LSGO was found to be 3191. As for the CPU time, the 
global constrained algorithm takes longer time than its local counterpart in each 
iteration. 
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Example 3: 2-D shape 

For this example, the network is to learn a two-dimensional sine function. The 
network size chosen was (2-15-1) and the error goal was set at 0.8 with 289 input- 
output training sets. Similar to the above examples, b was chosen to be 0.0001 
throughout. For the global descent algorithm (LSGO), the following settings 
were chosen: LSGO: 7 = 2, p = 10, C = 9.999999, c = 1000. The training results 
for 100 trials are shown in Table 1. Here we note that the GOP for this example 
indicates the percentage of trials reaching the error goal within 500 iterations. 

From Table 1, we see that for all 100 trials, the LS method was unable to 
descent towards the error goal within 500 iterations. In fact, we observed that 
most of these trials had landed on local minima which are much higher than the 
error goal. The LSGO had improved the situation with only one trial resulted 
in SSE slightly greater than the error goal at the end of 500th iteration. 

Remark 3:- Despite the remarkable convergence using random initial estimates 
for all the examples, it is noted that when the initial point was chosen at some of 
those local minima, the penalty based algorithms converge with extremely slow 
speed and were unable to locate the global minima within 5000 iterations. □ 



6.2 Face recognition 

Face recognition represents a difficult classification problem since it has to deal 
with large amount of variations in face images due to viewpoint, illumination 
and expression difference even for similar person. As such, many recognition 
conditions are ill-posed because “the variations between the images of the same 
face due to illumination and viewing direction are almost always larger than 
image variation due to change in face identity” [15]. 

Here we use the ORL Cambridge database [18] for classification. The ORL 
database contains 40 distinct persons, each having 10 different images taken 
under different conditions: different times, varying lighting (slightly), facial ex- 
pression (open/closed eyes, smiling/non-smiling) and facial details (glasses/no- 
glasses). All the images are taken against a dark homogeneous background and 
all persons are in up-right, frontal position except for some tolerance in side 
movement. The face recognition procedure is performed in two stages, namely: 

1. Feature extraction: the training set and query set are derived in the same way 
as in [12] where 10 images of each of the 40 persons are randomly partitioned 
into two sets, resulting in 200 images for training and 200 images for testing 
with no overlapping between them. Each original image is then projected 
onto the feature spaces derived from Eigenface [21], Eisherface [3] and D- 
LDA [23] methods. 

2. Classification: conventional nearest centre (NC), nearest neighbour (NN) 
and our proposed ENN method are used for classification. The error rates 
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are taken only from the averages of test errors obtained from 8 different 
runs.^ 



The network chosen for this application consists of 40 separate FNN (one 
network per person), each with (7Vi-l-l) structure considering network dilution 
for good generalization [11]. The number of inputs Ni = 39 is set according to 
the feature dimension obtained from projections using Eigenface, Fisherface and 
D-LDA. During the training phase, each of the 40 FNN outputs was set to ‘1’ 
for one corresponding class (person) while the others set to ‘O’. The global FNN 
was tested for 5 cases: FNN(a)-(e), each for 500 learning iterations, all using 
7 = 4 ,p= 10,C = 9.9999. Among the 40 individual outputs of the FNNs, the 
output with the highest value is assigned to the corresponding class. Training 
and testing error rates were obtained from the number of mis-classified persons 
divided by 200. The results corresponding to other different FNN settings and 
various feature extraction methods are shown in Table 2. 



Table 2. FNN settings and corresponding error rates 





Settings 


Classification error rate (%) | 


Classification Method 


FNN settings 


Eigenface 


D-LDA 


Fisherface 




Wo c b 


Train Test 


Train Test 


Train Test 


NC 

NN 

FNN(a) 

FNN(b) 

FNN(c) 

FNN(d) 

FNN(e) 


k X 10^“ 0 lO"* 

k X 10'“ lO’^ 10~“ 
k X 10'“ lO’^ 0.05 
k X 10'* 10* 0.25 

R X 10'* 10* 0.05-0.5 


- 12.25 

- 6.50 
3.69 25.88 
3.56 26.75 

0 5.38 
0 5.06 
0 3.88 


- 5.56 

- 5.38 
34.44 45.31 
34.25 47.06 

0 5.62 
0 5.31 
0 4.63 


8.12 

- 8.81 
69.25* 79.25* 
69.25* 79.38* 
0 12.56 
0 7.81 

0 7.81 



* : encounter singularity of matrix for some cases during training. 



From Table 2, we see that poor results were obtained for both the con- 
ventional FNN (FNN(a): unconstrained minimization on si with c = 0) and 
the global descent FNN for regression (FNN(b): constrained minimization on 
Si with c = 10^). However, remarkable improvement was observed when the 
weightage b was set at 0.05 as seen in FNN(c). With smaller initial values {k 
indicates k = l,...,p offset by its mean) and a higher b, the results were seen 
to improve further in FNN(d). In FNN(e), we provide our best achievable re- 
sults from random initial estimates {R x 10^®) and variations of b value. We see 
that our simple training method can provide good generalization capability as 
compared to best known network based method that incorporated several ideas: 
local receptive fields, shared weights, and spatial subsampling [12]. In short, for 
this case of using half the data set for training, our best FNN provides a good 
error rate as compared to best known methods reported in [12]: Top-down HMM 
(13%), Eigenfaces (10.5%), Pseudo 2D-HMM (5%) and SOM+CN (3.8%). 



® Here we note that only 3 runs were tested in [12]. 
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7 Conclusion 

In this paper, we propose to train a regularized FNN using a global search 
method. We have presented explicit vector and matrix canonical forms for the 
Jacobian and the Hessian of the FNN prior to convexity analysis of the weighted 
?2-norm error function. The sufRcient conditions for such convex characteriza- 
tion are derived. This permits direct means to analyze the network in aspects of 
network training and possibly network pruning. Results from the convex char- 
acterization are utilized in an attempt to characterize the global optimality of 
the FNN error function which is suitable for regression and classification appli- 
cations. By means of convex monotonic transformation, a sufficient condition for 
the FNN training to attain global optimality is proposed. The theoretical results 
are applied directly to network training using a simple constrained search. Sev- 
eral numerical examples show remarkable improvement in terms of convergence 
of our network training as compared to available local methods. The network 
learning is also shown to possess good generalization property in a face recogni- 
tion problem. It is our immediate task to generalize these results to a network 
with multiple outputs. Design of more robust constrained search remains an 
issue to guarantee global convergence. 
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Abstract. It is well known that the problem of matching two relational 
structures can be posed as an equivalent problem of finding a maximal 
clique in a (derived) association graph. However, it is not clear how to 
apply this approach to computer vision problems where the graphs are 
connected and acyclic, i.e. are free trees, since maximal cliques are not 
constrained to preserve connectedness. Motivated by our recent work on 
rooted tree matching, in this paper we provide a solution to the problem 
of matching two free trees by constructing an association graph whose 
maximal cliques are in one-to-one correspondence with maximal subtree 
isomorphisms. We then solve the problem using simple payoff-monotonic 
dynamics from evolutionary game theory. We illustrate the power of the 
approach by matching articulated and deformed shapes described by 
shape-axis trees. Experiments on hundreds of larger (random) trees are 
also presented. The results are impressive: despite the inherent inabil- 
ity of these simple dynamics to escape from local optima, they always 
returned a globally optimal solution. 



1 Introduction 

Graph matching is a classic problem in computer vision and pattern recogni- 
tion, instances of which arise in areas as diverse as object recognition, motion 
and stereo analysis (see, e.g., [2]). A well-known approach to solving this prob- 
lem consists of transforming it into the equivalent problem of finding a maximum 
clique in an auxiliary graph structure, known as the association graph [2]. The 
idea goes back to Ambler et al. [1], and has since been successfully employed in a 
variety of different problems. This framework is attractive because it casts graph 
matching as a pure graph-theoretic problem, for which a solid theory and pow- 
erful algorithms have been developed. Note that, although the maximum clique 
problem is known to be AP-hard, powerful heuristics exist which efficiently find 
good approximate solutions [6]. 

In many computer vision and pattern recognition problems, the graphs at 
hand have a peculiar structure: they are connected and acyclic, i.e. they are free 
trees [4,11,19,32]. Note that, unlike “rooted” trees, in free trees there is no distin- 
guished node playing the role of the root, and hence no hierarchy is imposed on 
them. Since in the standard association graph formulation the solutions are not 
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constrained to preserve connectedness, it is not clear how to apply the frame- 
work in these cases, and the extension of association graph techniques to free 
tree matching problems is therefore of considerable interest. 

Motivated by our recent work on rooted tree matching [27], in this paper we 
propose a solution to this problem by providing a straightforward way of deriving 
an association graph from two free trees. We prove that in the new formulation 
there is a one-to-one correspondence between maximal (maximum) cliques in 
the derived association graph and maximal (maximum) subtree isomorphisms. 
As an obvious corollary, the computational complexity of finding a maximum 
clique in such graphs is therefore the same as the subtree isomorphism problem, 
which is known to be polynomial in the number of nodes [13,29]. 

Following [25,27], we use a recent generalization of the Motzkin-Straus theo- 
rem [23] to formulate the maximum clique problem as a quadratic programming 
problem. To (approximately) solve it we employ payoff-monotonic dynamics, 
a class of simple dynamical systems recently developed and studied in evolu- 
tionary game theory [17,30]. Such continuous solutions to discrete problems are 
interesting as they can motivate analog and biological implementations. 

We illustrate the power of the approach via several examples of matching 
articulated and deformed shapes described by shape-axis trees [19]. We also 
present experiments on hundreds of much larger (uniformly random) trees. The 
results are impressive: despite the counter-intuitive maximum clique formulation 
of the tree matching problem, and the inherent inability of these simple dynamics 
to escape from local optima, they always found a globally optimal solution. 

2 Subtree Isomorphisms and Maximal Cliques 

Let G = (V,E) be a graph, where V is the set of nodes and E is the set of 
(undirected) edges. The order of G is the number of nodes in V, while its size 
is the number of edges. Two nodes u,v £ V are said to be adjaeent (denoted 
u ~ u) if they are connected by an edge. The adjaeeney matrix of G is the the 
n X n symmetric matrix Aq = (ajj) defined as 

^ r 1, if u, ~ Uj 
(0, otherwise . 

The degree of a node u, denoted deg(u), is the number of nodes adjacent to it. 
A path is any sequence of distinct nodes uqui . . .Un such that for alH = 1 . . . n, 
~ Ui; in this case, the length of the path is n. If ~ uq the path is called 
a eyele. A graph is said to be eonneeted if any two nodes are joined by a path. 
The distance between two nodes u and v, denoted by d{u, w), is the length of the 
shortest path joining them (by convention d{u, v) = oo, if there is no such path). 
Given a subset of nodes C CV, the induced subgraph G[C] is the graph having 
G as its node set, and two nodes are adjacent in G[C\ if and only if they are 
adjacent in G. A connected graph with no cycles is called a free tree, or simply 
a tree. Trees have a number of interesting properties. One which turns out to be 
very useful for our characterization is that in a tree any two nodes are connected 
by a unique path. 
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Let T\ = (Vi, El) and = (V2, E2) be two trees. Any bijection <f) : H\ ^ H2, 
with Hi C Vi and H2 C V2, is called a subtree isomorphism if it preserves both 
the adjacency relationships between the nodes and the connectedness of the 
matched subgraphs. Formally, this means that, given u,v £ Hi, we have u ~ n 
if and only if (j){u) ~ cj){v) and, in addition, the induced subgraphs Ti[Hi] and 
T2[H2] are connected. A subtree isomorphism is maximal if there is no other 
subtree isomorphism <f)' : H[ ^ H'2 with Hi a strict subset of H[, and maximum 
if Hi has largest cardinality. The maximal (maximum) subtree isomorphism 
problem is to find a maximal (maximum) subtree isomorphism between two 
trees. 

The free tree association graph (FTAG) of two trees Ti = (Vi,Ei) and T2 = 
(V2, E2) is the graph G = {V, E) where 

V = VixV 2 (1) 

and, for any two nodes (u, w) and (v, z) in V, we have 

(u, w) ~ [v, z) d(u, v) = d(w, z) . (2) 

Note that this definition of the association graph is stronger than the standard 
one used for matching arbitrary relational structures [2]. 

A subset of vertices of G is said to be a clique if all its nodes are mutually 
adjacent. A maximal clique is one which is not contained in any larger clique, 
while a maximum clique is a clique having largest cardinality. The maximum 
clique problem is to find a maximum clique of G. 

The following theorem, which is the basis of the work reported here, estab- 
lishes a one-to-one correspondence between the maximum subtree isomorphism 
problem and the maximum clique problem 

Theorem 1 . Any maximal (maximum) subtree isomorphism between two trees 
induces a maximal (maximum) clique in the corresponding FTAG, and vice 
versa. 

Proof (outline). Let f : Hi ^ H2 he a maximal subtree isomorphism between 
trees Ti and T2, and let G = (V, E) denote the corresponding FTAG. Let C^f, CV 
be defined as 6*0 = {{u,(j){u)) : u G Hi}. From the definition of a subtree iso- 
morphism it follows that f maps the path between any two nodes u,v £ Hi onto 
the path joining 4 >{u) and ffv). This clearly implies that d[u,v) = d{ 4 >{u) , 4 >{v)) 
for all u £ Hi, and therefore is a clique. Trivially, is a maximal clique 
because f is maximal, and this proves the first part of the theorem. 

Suppose now that G = {(ui, rci), ■ ■ • , (u„, w„)} is a maximal clique of G, and 
let Hi = {ui, • ■ • , Un] C Vi and H2 = {vui, ■ ■ ■ , Wn} G V2- Define f : Hi ^ H2 
as = Wi, for all i = 1 . . .n. From the definition of the FTAG and the 

hypothesis that C is a clique, it is simple to see that f \s a one-to-one and onto 
correspondence between Hi and H2, which trivially preserves the adjacency 
relationships between nodes. The fact that f is a maximal isomorphism is a 
straightforward consequence of the maximality of G. 

To conclude the proof we have to show that the subgraphs that we obtain 
when we restrict ourselves to Hi and H2, i.e. Ti[Hi] and T2[H2\, are trees, and 
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this is equivalent to showing that they are eonnected. Suppose by contradiction 
that this is not the ease, and let Ut,uj e Hi be two nodes which are not joined 
by a path in Ti[Hi]. Since both Ui and Uj are nodes of Ti, however, there must 
exist a path Uj = xqXi . . .Xm = Uj joining them in Ti. Let x* = Xk, for some 
fc = 1 ... m, be a node on this path which is not in Hi. Moreover, let y* = yk 
be the fc-th node on the path Wi = yoyi ■ ■ ■ ym = Wj whieh joins u',; and wj in 
T 2 (remember that d{ui,Uj) = d{w^,Wj), and hence d{wi,Wj) = m). It is easy 
to show that the set {(x*,y*)} U C C F is a clique, thereby contradicting the 
hypothesis that C is a maximal elique. This can be proved by exploiting the 
obvious faet that if x is a node on the path joining any two nodes u and v, then 
d{u, v) = d{u, x) + d{x, v). 

The “maximum” part of the statement is proved similarly. □ 

The FTAG is readily derived by using a classical representation for graphs, 
i.e., the so-ealled distance matrix (see, e.g., [15]) which, for an arbitrary graph 
G = {V,E) of order n, is the n xn matrix D = {dij) where = d{ui,Uj), the 
distance between nodes Ui and uj. Efficient, classical algorithms are available for 
obtaining such a matrix [10]. Note also that the distance matrix of a graph can 
easily be eonstrueted from its adjaeency matrix Aq. In fact, denoting by the 
(i, j)-th entry of the matrix Aq, the n-th power of Aq, we have that dij equals 
the least n for which o" > 0 (there must be such an n since a tree is connected). 

3 Continuous Formulation of Maximum Clique 

After formulating the free tree matehing problem as a maximum clique prob- 
lem, we now proceed (following [27]) by mapping the latter onto a continuous 
quadratic programming problem. Let G = {V, E) be an arbitrary graph of order 
n, and let A denote the standard simplex of IR": 

Z\ = { X e IR" : e'x = 1 and x* > 0, i = 1 . . . n } 

where e is the vector whose components equal 1, and a prime denotes transpo- 
sition. Given a subset of vertiees C of G, we will denote by x“ its characteristic 
vector which is the point in A defined as 

ri/lci,ifzGC 
® (0, otherwise 

where |C| denotes the cardinality of C. 

Now, consider the following quadratic function 

/g(x) = x'Agx + ix'x (3) 

where Aq = (uij) is the adjacency matrix of G. The following theorem, recently 
proved by Bomze [5], expands on the Motzkin-Straus theorem [23], a remarkable 
result whieh establishes a connection between the maximum clique problem and 
certain standard quadratic programs. This has an intriguing computational sig- 
nificance in that it allows us to shift from the discrete to the continuous domain 
in an elegant manner. 
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Theorem 2. Let C be a subset of vertices of a graph G, and let be its char- 
aeteristic veetor. Then, C is a maximal (maximum) elique of G if and only if 

is a local (global) maximizer fa in A. Moreover, all loeal (and henee global) 
maximizers of fa in A are strict. 

Unlike the original Motzkin-Straus formulation, which is plagued by the pres- 
ence of “spurious” solutions [26], the previous result guarantees us that all maxi- 
mizers of fa on A are strict, and are characteristic vectors of maximal/maximum 
cliques in the graph. In a formal sense, therefore, a one-to-one correspondence 
exists between maximal cliques and local maximizers of fa in A on the one 
hand, and maximum cliques and global maximizers on the other hand. 



4 Matching Free Trees with Monotone Game Dynamics 

Payoff-monotonic dynamics are a wide class of dynamical systems developed 
and studied in evolutionary game theory, a discipline pioneered by J. Maynard 
Smith [20] which aims to model the evolution of animal behavior using the 
principles and tools of noncooperative game theory. In this section we discuss 
the basic intuition behind these models and present a few theoretical properties 
that are instrumental for their application to our optimization problem. For a 
more systematic treatment see [17,30]. 

Consider a large, ideally infinite population of individuals belonging to the 
same species which compete for a particular limited resource, such as food, ter- 
ritory, etc. In evolutionary game theory, this kind of conflict is modeled as a 
symmetric two-player game, the players being pairs of randomly selected pop- 
ulation members. In contrast to traditional application fields of game theory, 
such as economics or sociology, players here do not behave “rationally,” but act 
instead according to a pre-programmed behavior pattern, or pure strategy. Re- 
production is assumed to be asexual, which means that, apart from mutation, 
offspring will inherit the same genetic material, and hence behavioral phenotype, 
as its parent. 

Let J = {1, ■ • ■ ,n} be the set of available pure strategies and, for all i £ J, 
let Xi{t) be the proportion of population members playing strategy i, at time t. 
The state of the population at a given instant is the vector x = {x\, ■ • ■ ,Xn)' . 
Clearly, population states are constrained to lie in the standard simplex A. For 
a given population state x e Z\, we shall denote by ct(x) the support of x, i.e. 
the set of non-extinct strategies: 

cr(x) = {i £ J : Xt > 0} . 

Let A = (otj) be the nxn payoff (or utility) matrix. Specifically, for each pair 
of strategies i,j £ ,J, represents the payoff of an individual playing strategy 
i against an opponent playing strategy j. In biological contexts a player’s utility 
can simply be measured in terms of Darwinian fitness or reproductive success, 
i.e., the player’s expected number of offspring. If the population is in state x, 
the expected payoff earnt by an f-strategist is: 
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n 

^T^(x) = '^a^JXj = {Ax), (4) 

while the mean payoff over the entire population is 

n 

7t(x) = y^Xi7Ti(x) = X'^X . (5) 

i=l 

In evolutionary game theory the assumption is made that the game is played 
over and over, generation after generation, and that the action of natural selec- 
tion will result in the evolution of the fittest strategies. If successive generations 
blend into each other, the evolution of behavioral phenotypes can be described 
by a set of ordinary differential equations. A general class of evolution equations 
are given by: 

Xt=Xigi{x) (6) 

where a dot signifies derivative with respect to time, and g = {gi, ■ ■ ■ ,gn) is a 
function with open domain containing A. Here, the function g^ {i £ J) specifies 
the rate at which pure strategy i replicates. It is usually required that the growth 
functions g be regular [30], which means that it is Lipschitz continuous and that 
g{x) ■ X = 0 for all x £ Z\. The former condition guarantees us that the system 
of differential equations (6) has a unique solution through any initial population 
state. The condition g{x) - x = 0, instead, ensures that the simplex A is invariant 
under (6), namely any trajectory starting in A will remain in A. 

Payoff-monotonic game dynamics represent a wide class of regular selection 
dynamics for which useful properties hold. Intuitively, for a payoff-monotonic 
dynamics the strategies associated to higher payoffs will increase at higher rate. 
Formally, a regular selection dynamics (6) is said to be payoff-monotonic if: 

g,(x) > gj(x) 7T,(x) > 7Tj(x) (7) 

for all X e A. 

It is simple to show that all payoff-monotonic dynamics share the same set 
of stationary points. 

Proposition 1. A point x £ Z\ is stationary under any payoff-monotonic dy- 
namics if and only if TTt{x) = n{x) for all i £ cr{x). 

Proof. If 7Ti(x) = 7t(x) for all i £ ct(x), by monotonicity (7), there exists /x £ IR 
such that , 9 i(x) = g for all i £ cr(x). But, since g is a regular growth-rate 
function, J2iXi9i{^) = 0, and hence p = 0. Therefore, x is a stationary point 
for (6).. On the other hand, if x is stationary, then gi{x) = 0 for all i £ cr(x). 
Hence, there exists a A £ IR such that TTi{x) = A for all i G <t(x). But then 
^ (x) = 7 t(x). □ 

In an unpublished paper [16], Hofbauer shows that the average population 
payoff is strictly increasing along the trajectories of any payoff-monotonic dy- 
namics. This result generalizes the celebrated fundamental theorem of natural 
selection [17,30]. Here, we provide a different proof adapted from [12]. 
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Theorem 3. If the payoff matrix A is symmetric, then 7t(x) = x'Ax is strictly 
increasing along any non-constant trajectory of any payoff-monotonic dynamics. 
In other words, 7r(x(t)) > 0 for all t, with equality if and only if x = x{t) is a 
stationary point. 

Proof. For x G A, let: 



0"+(x) = {i G cr(x) : g,{x) > 0} 



and 



o-_(x) = {i G o-(x) : g,{x) < 0} . 
Clearly, (t+(x) U ct_(x) = a{x). Moreover, let: 

^(x) = min{7Ti(x) : i G ct+(x)} 



and 



7t(x) = max{7ri(x) : i G <7_(x)} 



Because of payoff-motonicity, note that ^(x) > 7r(x). Note also that i:, > 0 if 
and only if i G cr+(x). 

Because of the symmetry of A, we have: 



2 



^ i,7T,(x) 

iGcr(x) 

^ ij7T,:(x) + ^ i,7T,(x) 

iSCT+(x) tG<T-(x) 



> 7t(x) ^ + 7t(x) X , 

iGcr+(x) i6(T_(x) 

= (^(x) -7t(x)) Y 



where the last equality follows from ^ = 0. Finally, note that 7t(x) = 0 if and 

only if 7Ti(x) is constant for all i G cr(x) which, from Proposition 1, amounts to 
saying that x is stationary. □ 

A well-known subclass of payoff-monotonic game dynamics is given by: 



Xi = Xi 



f{^T^{x)) - ^Xj/(7Tj(x)) 



i—1 



( 8 ) 



where f{u) is an increasing function of u. These models arise in modeling the 
evolution of behavior by way of imitation processes, where players are occasion- 
ally given the opportunity to change their own strategies [16,30]. 
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When / is the identity function, i.e., f{u) = u, we obtain the standard 
replicator equations: 



Xi = X, 



7Tj(x) 






Xj7Tj(x) 



(9) 



whose basic idea is that the average rate of increase Xi jxi equals the difference 
between the average fitness of strategy i and the mean fitness over the entire 
population. 

Another popular model arises when /(u) = which yields: 



/ n 

„ „K7rj(x) 

Ju ^ — vO 2 I ^ y t-O j o 

V 

where k is a positive constant. As k tends to 0, the orbits of this dynamics 
approach those of the standard, first-order replicator model (9), slowed down 
by the factor k; moreover, for large values of k the model approximates the so- 
called “best-reply” dynamics [16,17]. As it turns out [16], these models behave 
essentially in the same way as the standard replicator equations (9), the only 
significant difference being the size of the basins of attraction around stable 
equilibria. From a computational perspective, exponential replicator dynamics 
are particularly attractive as they may be considerably faster and even more 
accurate than the standard, first-order model (see [25] and the results reported 
below) . 

In light of their dynamical properties, payoff- monotonic dynamics naturally 
suggest themselves as simple heuristics for solving the maximal subtree isomor- 
phism problem. Let T\ = {V\,E\) and = (V 2 ,i? 2 ) be two free trees, and let 
Ac denote the adjacency matrix of their FTAG G. By putting 

A = Ac + \l (11) 

where / is the identity matrix, we know from Theorem 3 that any payoff- 
monotonic dynamics starting from an arbitrary initial state, will iteratively max- 
imize the function fc defined in (3) over the simplex and will eventually converge 
with probability 1 to a strict local maximizer which, by virtue of Theorem 2, will 
then correspond to the characteristic vector of a maximal clique in the associa- 
tion graph. As stated in Theorem 1, this will in turn induce a maximal subtree 
isomorphism between T\ and T 2 . 

Clearly, in theory there is no guarantee that the converged solution will be 
a global maximizer of fc, and therefore that it will induce a maximum isomor- 
phism between the two original trees. Previous experimental work with standard 
replicator dynamics [7,24,25,27] and also the results presented in the next sec- 
tion, however, suggest that the basins of attraction of optimal or near-optimal 
solutions are quite large, and very frequently the algorithm converges to one of 
them, despite its inherent inability to escape from local optima. 
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5 Experimental Results 

In this section we present experiments of applying payoff-inonotonic dynamics 
to the free tree matching problem. In our simulations, we used the following 
discrete-time models: 



Xi{t + 1) = 



X,Xt)TT^{t) 



and 



Xi{t + 1 ) 






( 12 ) 



(13) 



which correspond to well-known discretizations of equations (9) and (10), respec- 
tively [9,14,17,30]. Note that model (12) is the standard discrete-time replicator 
dynamics, which have already proven to be remarkably effective in tackling max- 
imum clique and related problems, and to be competitive to other more elab- 
orated neural network heuristics [5,7,8,24,25,27]. Equation (13) has been used 
in [25] to approximate the graph isomorphism problem. For the latter dynamics 
the value k = 10 was used. 

Both the first-order and the exponential processes were started from the sim- 
plex barycenter and stopped when either a maximal clique (i.e., a local maximizer 
of fa) was found or the distance between two successive points was smaller than 
a fixed threshold. In the latter case the converged vector was randomly per- 
turbed, and the algorithms restarted from the perturbed point. Because of the 
one-to-one correspondence between local maximizers and maximal cliques, this 
situation corresponds to convergence to a saddle point. 



5.1 Matching Shape-Axis Trees 

Recently, Liu et al. [19] introduced a new representation for shape based on the 
idea of self-similarity. Intuitively, given a closed planar shape, they consider two 
different parameterizations of its contour, namely one oriented counterclockwise, 
r{s) = {a:(.s) : 0 < ,s < 1}, and the other clockwise, r{t) = {x{t) = x{l — t) : 0 < 
t < 1}. By minimizing an appropriate cost functional they find a “good” match 
between F and F, and then define the shape axis (SA) as the loci of middle 
points between the matched contour points. From a given SA, it is possible to 
construct a unique free tree, called the SA-tree, by grouping the discontinuities 
contained in the SA. In Figure 1 the SA-tree construction process is illustrated, 
and Figure 2 shows the SA-trees derived from a few example shapes. 

The proposed matching algorithms were tested on a selection of 17 shapes 
(SA-trees) representing six different object classes (horse, human, bird, dog, 
sheep, and rhino). We matched each shape against each other (and itself) and 
in all the 289 trials both algorithms returned the maximum isomorphism, i.e. 
a maximum clique in the FTAG. Figure 3 shows a few example matches. This 
is a remarkable fact, considering that replicator dynamics are unable to escape 
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Fig. 1. Illustration of the SA-tree construction. Three shapes (left), their shape- 
model (middle), and the corresponding SA-trees (right). 










Fig. 2. Examples of SA-trees, under various shape deformations. 
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Fig. 3. Some examples of matching SA-trees. 



from local solutions. Similar findings on related problems are discussed in [25,27]. 
As far as the computational time is concerned, both dynamics took only a few 
seconds to converge on a 350MHz AMDK6-2 processor, the exponential one 
being slightly faster than the linear one (but see below for a rather different 
picture). 



5.2 Matching Larger Trees 

Encouraged by the results reported above, we proceeded by testing our algo- 
rithms over much larger (random) trees. Random structures represent a useful 
benchmark not only because they are not constrained to any particular applica- 
tion, but also because it is simple to replicate experiments and hence to make 
comparisons with other algorithms. 

In this series of experiments, the following protocol was used. A hundred 100- 
node free trees were generated uniformly at random using a procedure described 
by Wilf in [31]. Then, each such tree was subject to a corruption process which 
consisted of randomly deleting a fraction of its nodes (in fact, the to-be-deleted 
nodes were constrained to be the terminal ones, otherwise the resulting graph 
would have been disconnected) , thereby obtaining a tree isomorphic to a proper 
subtree of the original one. Various levels of corruption (i.e., percentage of node 
deletion) were used, namely 2%, 10%, 20%, 30% and 40%. This means that the 
order of the pruned trees ranged from 98 to 60. Overall, therefore, 500 pairs of 
trees were obtained, for each of which the corresponding FTAG was constructed 
as described in Section 2. To keep the order of the association graph as low as 
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Percentage of deleted nodes 




Percentage of deleted nodes 



Fig. 4. Results obtained over 100-node random trees with various levels of corrup- 
tion, using the first-order dynamics (12). Top: Percentage of correct matches. Bottom: 
Average computational time taken by the replicator equations. 



possible, its vertex set was constructed as follows: 

V = {{u,w) e V X V" : deg(u) < deg(u;)} , 

assuming \V'\ < \V"\, the edge set E being defined as in (2). It is straightfor- 
ward to see that when the first tree is isomorphic to a subtree of the second, 
Theorem 1 continues to hold. This simple heuristic may significantly reduce the 
dimensionality of the search space. We also performed some experiments with 
unpruned FTAG’s but no significant difference in performance was noticed apart, 
of course, heavier memory requirements. 

As in the previous series of experiments, both the linear and the exponen- 
tial dynamics were used, with identical parameters and stopping criterion. After 
convergence, we calculated the proportion of matched nodes, i.e., the ratio be- 
tween the cardinality of the clique found and the order of the smaller subtree, 
and then we averaged. Figure 4(a) shows the results obtained using the linear 
dynamics (12) as a function of the corruption level. As can be seen, the algorithm 
was always able to find a correct maximum isomorphism, i.e. a maximum clique 
in the FTAG. Figure 4(b) plots the corresponding (average) CPU time taken by 
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Percentage of deleted nodes 




Percentage of deleted nodes 



Fig. 5. Results obtained over 100-node random trees with various levels of corruption, 
using the exponential dynamics (13). Top: Percentage of correct matches. Bottom: 
Average computational time taken by the replicator equations. 



the processes, with corresponding error bars (simulations were performed on the 
same machine used for the shape-axis experiments). 

In Figure 5, the results pertaining to the exponential dynamics (13) are 
shown. In terms of solution’s quality the algorithm performed exactly as its 
linear counterpart, but this time it was dramatically faster. This confirms earlier 
results reported in [25]. 



6 Conclusions 



We have developed a formal approach for matching connected and acyclic re- 
lational structures, i.e. free trees, by constructing an association graph whose 
maximal cliques are in one-to-one correspondence with maximal subtree isomor- 
phisms. The framework is general and can be applied in a variety of computer 
vision domains: we have demonstrated its potential for shape matching. The so- 
lution is found by using payoff-monotonic dynamical systems, which make them 
amenable to hardware implementation and offers the advantage of biological 
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plausibility. In particular, these relaxation labeling equations are related to pu- 
tative neuronal implementations [21,22]. Extensive experiments on hundreds of 
uniformly random trees have also been conducted and, as in previous work on 
graph isomorphism [25] and rooted tree matching [27], the results are impressive. 
Despite the counter-intuitive maximum clique formulation of the tree matching 
problem, and the inherent inability of these simple dynamics to escape from local 
optima, they nevertheless were always able to find a globally optimal solution. 

Before concluding, we note that the framework can easily be extended to 
tackle the problem of matching attributed (free) trees. In this case, the at- 
tributes result in weights being placed on the nodes of the association graph, 
and a conversion of the maximum clique problem to a maximum weight clique 
problem [27,8]. Note also that, since the presented approach does not allow for 
many-to-many correspondences, it cannot be compared as is with edit-distance 
tree matching algorithms, as the one presented in [18]. However, it is straight- 
forward to formulate many-to-one or many-to-many versions of our framework 
along the lines suggested in [3,28] for rooted attributed trees. This will be the 
subject of future investigations. 
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Abstract. This paper investigates an approach to tree edit distance 
problem with weighted nodes. We show that any tree obtained with 
a sequence of cut and relabel operations is a subtree of the transitive 
closure of the original tree. Furthermore, we show that the necessary 
condition for any subtree to be a solution can be reduced to a clique 
problem in a derived structure. Using this idea we transform the tree 
edit distance problem into a series of maximum weight clique problems 
and then we use relaxation labeling to find an approximate solution. 



1 Introduction 

The problem of how to measure the similarity of pictorial information which has 
been abstracted using graph-structures has been the focus of sustained research 
activity for over twenty years in the computer vision literature. Moreover, the 
problem has recently acquired significant topicality with the need to develop 
ways of retrieving images from large data-bases. Stated succinctly, the problem is 
one of inexact or error-tolerant graph-matching. Early work on the topic included 
Barrow and Burstall’s idea [1] of locating matches by searching for maximum 
common subgraphs using the association graph, and the extension of the concept 
of string edit distance to graph- matching by Fu and his co-workers [6]. The 
idea behind edit distance [18] is that it is possible to identify a set of basic 
edit operations on nodes and edges of a structure, and to associate with these 
operations a cost. The edit-distance is found by searching for the sequence of 
edit operations that will make the two graphs isomorphic with one-another and 
which has minimum cost. By making the evaluation of structural modification 
explicit, edit distance provides a very effective way of measuring the similarity of 
relational structures. Moreover, the method has considerable potential for error 
tolerant object recognition and indexing problems. 

Unfortunately, the task of calculating edit distance is a computationally hard 
problem and most early efforts can be regarded as being goal-directed. However, 
in an important series of recent papers, Bunke has demonstrated the intimate 
relationship between the size of the maximum common subgraph and the edit 
distance [4] . In particular, he showed that, under certain assumptions concerning 
the edit-costs, computing the MCS and the graph edit distance are equivalent. 
The restriction imposed on the edit-costs is that the deletions and re-insertions 
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of nodes and edges are not more expensive than the corresponding node or 
edge relabeling operations. In other words, there is no incentive to use relabeling 
operations, and as a result the edit operations ean be reduced to those of insertion 
and deletion. 

The work reported in this paper builds on a simple observation which follows 
from Bunke’s work. By re-casting the search for the maximum common subgraph 
as a max clique problem [1], then we can efficiently compute the edit distance. 
A diverse array of powerful heuristies and theoretical results are available for 
solving the max clique problem. In particular the Motzkin-Straus theorem [10] 
allows us to transform the max clique problem into a eontinuous quadratic pro- 
gramming problem. An important recent development is reported by Pelillo [11] 
who shows how probabilistic relaxation labeling can be used to find a (loeal) 
optimum of this quadratic programming problem. 

In this paper we are interested in measuring the similarity of tree structures 
obtained from a skeletal representation of 2D shape. While trees are a special 
case of graphs, because of the connectivity and partial order eonstraints which 
apply to them, the methods used to compare and match them require signifi- 
cant specific adaptation. For instance, Bartoli et al. [2], use the graph theoretic 
notion of a path string to transform the tree isomorphism problem into a sin- 
gle max weighted clique problem. This work uses a refinement of the Motzkin 
Strauss theorem to transform the max weighted clique problem into a quadratic 
programming problem on the simplex [3], the quadratic problem is then solved 
using relaxation labeling. 

Because of the added connectivity and partial order constraints mentioned 
above, Bunke’s result linking the computation of edit distance to the size of the 
maximum common subgraph does not translate in a simple way to trees. Fur- 
thermore, specific characteristics of trees suggest that posing the tree-matching 
problem as a variant on graph-matching is not the best approach. In particular, 
both the tree isomorphism and the subtree isomorphism problems have efficient 
polynomial time solutions. Moreover, Tai [16] has proposed a generalization of 
the string edit distance problem from the linear structure of a string to the non- 
linear structure of a tree. The resulting tree edit distance differs from the general 
graph edit distance in that edit operations are carried out only on nodes and 
never directly on edges. The edit operations thus defined are node deletion, node 
insertion and node relabeling. This simplified set of edit operations is guaranteed 
to preserve the connectivity of the tree structure. Zhang and Shasha [22] have 
investigated a special case which involves adding the constraint that the solu- 
tion must maintain the order of the children of a node. With this order among 
siblings, they showed that the tree-matching problem is still in P and gave an al- 
gorithm to solve it. In subsequent work they showed that the unordered case was 
indeed an NP hard problem [23]. The NP-completeness, though, can be elimi- 
nated again by adding particular constraints to the edit operations. In particular, 
it can be shown that the problem returns to P when we add the constraint of 
strict hierarchy, that is when separate subtrees are constrained to be mapped to 
separate subtrees [21]. 
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In this paper we propose an energy minimization method for efficiently com- 
puting the weighted tree edit distance. We follow Pelillo [11] by casting the 
problem into the Motzkin-Straus framework. To achieve n this goal we use the 
graph-theoretic notion of tree closure. We show that, given a tree T, then any 
tree obtained by cutting nodes from T is a subtree of the closure of T. Further- 
more, we can eliminate subtrees that can not be obtained from T by solving 
a series of max-clique problems. In this way we provide a divide and conquer 
method for finding the maximum edited common subtree by searching for max- 
imal cliques of an association graph formed from the closure of the two trees. 
With this representation to hand, we follow Bomze et al. [3] and use a variant 
of the Motzkin Straus theorem to convert the maximum weighted clique prob- 
lem into a quadratic programming problem which can be solved by relaxation 
labeling. 

2 Exact Tree Matching 

In this section we describe a polynomial time algorithm for the subtree isomor- 
phism problem. This allows us to formalize some concepts and give a starting 
point to extend the approach to the minimum tree edit distance problem. 



2.1 Association Graph 

The phase space we use to represent the matching of nodes is the directed as- 
sociation graph, a variant of the association graph. The association graph is 
a structure that is frequently used in graph matching problems. The nodes of 
the association graph are the Cartesian products of nodes of the graphs to be 
matched. Hence, each node represents a possible association, or match, of a node 
in one graph to a node in the other. The edges of the association graph represent 
the pairwise constraints of the problem: they represent both connectivity on the 
original graphs and the feasibility of a solution with the linked associations. 

Hierarchical graphs have an or- 
der relation induced by paths: given 
two nodes a and b, (a, b) is in this 
relation if and only if there is a 
path from a to b. When the directed 
graph is acyclical, this relation can 
be shown to be an (irreflexive) or- 
der relation. The use of directed arcs 
in the association graph allows us to 
make use of this order. We connect 
nodes with directed arcs in a way 
that preserves the ordering of the as- 
sociated graph. The graph obtained 
can be shown to be ordered still. Specifically, an association graph for the tree 
isomorphism problem can be shown to be a forest. 
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For the exact isomorphism problem (maximum common subgraph) the edges 
of the association graphs are: 

(u, v') — > (u, u) iff u — u and v' u (1) 

where v and u are nodes on one graph and v' and u' are nodes on the other. 

Proposition 1. The directed association graph of two directed acyclic graphs 
(DAGs) G and G' is acyclic. 

Proof. Let us assume that (ui,vi) ■ • • — i {un,Vn) is a cycle. Then, since an 
arc (u, v') — > (u, u') in the association graph exists only if the arcs v ^ u and 
v' u' exist in G and G' respectively, we have that u\ ^ ^ u„ is a, cycle 

in G and ui — > ■ • • — > u„ is a cycle in G' against the hypothesis that they are 
DAGs. 



Proposition 2. The directed association graph of two trees t and t' is a forest. 

Proof. We already know that the association graph is a DAG, we have to show 
that for each node {u,u') there is at most one node (u, u') such that {v,v') — )■ 
(u,u'). Due to the way the association graph is constructed this means that 
either u or u' must have at most one incoming edge. But t and t' are trees, so 
both u and u' have at most one incoming edge, namely the one from the parent. 

The directed association graph can be used to reduce a tree matching problem 
into subproblems: the best match given the association of nodes v and v' can be 
found examining only descendents of v and v' This gives us a divide and conquer 
solution to the maximum common subtree problem: use the association graph 
to divide the problem and transform it into maximum bipartite match subprob- 
lems, the subproblems can then be efficiently conquered with known polynomial 
time algorithms. We then extend the approach to the minimum unlabeled tree 
edit problem and present an evolutionary method to conquer the subproblems. 
Finally, we present a method to convert the divide and conquer approach into a 
multi-population evolutionary approach. 



2.2 Maximum Common Subtree 

We present a divide and conquer approach to the exact maximum common sub- 
tree problem. We call the maximum common subtree rooted at (u, v') a solution 
to the maximum common subtree problem applied to two subtrees of t and t' . 
In particular, the solution is on the subtrees of t and t' rooted at u and v' re- 
spectively. This solution is further constrained with the condition that v and v' 
are roots of the matched subtrees. 

With the maximum rooted isomorphism problem for each children of {v.v') 
at hand, the maximum isomorphism rooted at {v.v') can be reduced to a max- 
imum bipartite match problem. The two partitions V and V of the bipartite 
match consist of the children of v and v' respectively. The weight of the match 
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between u E V and u' E V is the sum of the matched weights of the maxi- 
mum isomorphism rooted at In case of a non weighted tree this is the 

cardinality of the isomorphism. With this structure we have a one-to-one rela- 
tionship between matches in the bipartite graph and the children of {v,v') in 
the association graph. The solution of the bipartite matching problem identifies 
a set of children of {v, v') that satisfy the constraint of matching one node of t 
to no more than one node of t' . Furthermore, among such sets is the one that 
guarantees the maximum total weight of the isomorphism rooted at {v,v'). 

The maximum isomorphism between t and t' is a maximum isomorphism 
rooted at where either v or v' is the root of t or t' respectively. This 

reduces the isomorphism problem to n + rn, rooted isomorphism problems, where 
n and m are the cardinality of t and t' . Furthermore, since there are nm nodes in 
the association graph, the problem is reduced to nm maximum bipartite match 
problems. 

3 Inexact Tree Matching 

We want to extend the algorithm to provide us with an error-tolerant tree iso- 
morphism. There is a strong connection between the computation of maximum 
common subtree and the tree edit distance. In [4] Bunke showed that, under 
certain constraints applied to the edit-cost function, computing the maximum 
common subgraph problem and the minimum graph edit distance are equivalent 
to one-another. 

This is not directly true for trees, because of the added constraint that a 
tree must be connected. But, extending the concept to the common edited sub- 
tree, we can use common substructures to find the minimum cost edited tree 
isomorphism. In particular, we want to match weighted trees. These are trees 
with weight associated to the nodes and with the property that the cost of an 
edit operation is a function of the weight of the nodes involved. 

Following common use, we consider three fundamental operations: 

— node removal: this operation removes a node and links the children to the 

parent of said node. 

— node insertion: the dual of node removal 

— node relabel: this operation changes the weight of a node. 

In our model the cost node removal and insertion is equal to the weight of the 
node, while the cost of relabeling a node is equal to the difference in the weights. 
This approach identifies node removal to relabel to 0 weight and is a natural 
interpretation when the weight represents the “importance” of the node. 

Since a node insertion on the data tree is dual to a node removal on the 
model tree, we can reduce the number of operations to be performed to only 
node removal, as long as we perform the operations on both trees. 

At this point we introduce the concept of edited isomorphism. Assuming that 
we have two trees T\ and T 2 and a tree T' that can be obtained from both with 
node removal and relabel operations, T' will induce an isomorphism between 
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nodes in 7i and T 2 so that places two nodes in correspondence if and only if 
they get cut down to the same node in T' . We call such isomorphism an edited 
isomorphism induced by T' . From the definition it is clear that there is a tree T', 
obtained only with node removal and relabel operations, so that the sum of the 
edit distance from this tree to Ti and T 2 is equal to the edit distance between Ti 
and T 2 , i.e. a median tree. We say that the isomorphism induced by this tree is 
a maximum edited isomorphism because it maximizes j) min(uij, Wj), 

where i and j are nodes matched by the isomorphism, and Wj and Wj are their 
weights. In fact, if we W\ and W 2 be the weights in T\ and T 2 respectively, the 
edit distance between T\ and T' is W\ — W^, the distance between T\ and T 2 is 
W\ + W 2 — 2Wm. Clearly, finding the maximum edited isomorphism is equivalent 
to solving the tree edit distance problem. 



3.1 Editing the Transitive Closnre of a Tree 

For each node v of t, we can define an edit operation Ey on the tree and an edit 
operation £y on the closure Ct of the tree t (see Figure 1). In both cases the 
edit operation removes the node v, all the incoming edges, and all the outgoing 
edges. 

We show that the transitive closure operation and the node removal operation 
commute, that is we have: 

Lemma 1. £y(C(t)) = C{Ey{t)) 

Proof. If a node is in £y{C{t)) it is clearly also in C{Ey{t)). What is left is to 
show is that an edge (a, b) is in £y{C{t)) if and only if it is in C{Ey{t)). 

If (a, b) is in C{Ey{t)) then neither a nor 6 is u and there is a path from a to 6 
in Ey {t) . Since the edit operation Ey preserves connectedness and the hierarchy, 
there must be a path from a to 6 in t as well. This implies that (a, b) is in C(t). 
Since neither a nor b is v, the operation £.„ will not delete (a, 6). Thus (a, 6) is 
in £y{C{t)). 

If (a, 6) is in £y{C{t)), then it is also in C{t), because £y{C{t)) is obtained 
from C{t) by simply removing a node and some edges. This implies that there 
is a path from a to 6 in t and, as long as neither a nor b are v, there is a path 
from a to 6 in Ey[t) as well. Thus (a, b) is in C{Ey{t)). Since (a, b) is in £y{C{t)), 
both a and b must be nodes in £y{C{t)) and, thus, neither can be v. 

Furthermore, the transitive closure operation clearly commutes with node 
relabeling as well, since one acts only on weights and the other acts only on 
node connectivity. 

We call a subtree s of Ct consistent if for each node t> of s there cannot be 
two children a and b so that (o, b) is in Ct. In other words, given two nodes a 
and b, siblings in s, s is consistent if and only if there is no path from a to 6 in t. 
We can, now, prove the following: 

Theorem 1. A tree t can be obtained from a tree t with an edit sequence com- 
posed of only node removal and node relabeling operations if and only if t is a 
consistent subtree of the DAG Ct. 
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Proof. Let us assume that there is an edit sequence {Ey.} that transforms t into 
t, then, by virtue of the above lemma, the dual edit sequence {£y . } transforms 
Ct into Ct. By construction we have that t is a subtree of Ct and Ct is a subgraph 
of Ct, thus t is a subtree of Ct. Furthermore, since the node removal operations 
respect the hierarchy, t is a consistent subtree of Ct. 

To prove the converse, assume that t is a consistent subtree of Ct. If (a, b) is 
an edge of t, then it is an edge on Ct as well, i.e. there is a path from a to 6 in 
t and we can define a sequence of edit operations {Ey.} that removes any node 
between a and b in such a path. Showing that the nodes {v^} deleted by the edit 
sequence cannot be in t we show that all the edit operations defined this way 
are orthogonal. As a result they can be combined to form a single edit sequence 
that solves the problem. 

Let u in t be a node in the edited path and let p be the minimum common 
ancestor of v and a in t. Furthermore, let w be the only child of p in t that is an 
ancestor of u in t and let q be the only child of p in t that is an ancestor of a in 
t. Since a is an ancestor of v in t, an ancestor of v can be a descendant of a, an 
ancestor of a, or a itself. This means that w has to be in the edited path. Were 
it not so, then w had to be a or an ancestor of a against the hypothesis that p is 
the minimum common ancestor of v and a. Since q is an ancestor of a in t and 
a is an ancestor of w m t, q is an ancestor of w in t, but q and w are siblings in 
t against the hypothesis that t is consistent. 

Using this result, we can show that the minimum cost edited tree isomor- 
phism between two trees t and t' is a maximum common consistent subtree of 
the two DAGs Ct and Ct', provided that the cost of node removal and node 
matching depends only on the weights. 

The minimum cost edited tree isomorphism is a tree that can be obtained 
from both model tree t and data tree t' with node removal and relabel operations. 
By virtue of the theorem above, this tree is a consistent subtree of both Ct and 
Ct' . The tree must be obtained with minimum combined edit cost. Since node 
removal can be considered as matching to a node with 0 weight, the isomorphism 
that grants the minimum combined edit cost is the one that gives the maximum 
combined match, i.e. it must be the maximum common consistent subtree of the 
two DAGs. 



3.2 Cliques and Common Consistent Subtrees 

In this section we show that the directed association graph induces a divide 
and conquer approach to edited tree matching as well. Given two trees t and 
t' to be matched, we create the directed association graph of the transitive 
closures Ct and Ct' and we look for a consistent matching tree in the graph. 
That is we seek a tree in the graph that corresponds to two consistent trees in 
the transitive closures Ct and Ct' . The maximum such tree corresponds to the 
maximum common consistent subtree of Ct and Ct' . 

In analogy to what we did for the exact matching case, we divide the prob- 
lem into a maximum common consistent subtree rooted at {v,w), for each node 
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{v,w) of the association graph. We show that, given the weight of the maxi- 
mum common consistent subtree rooted at each child of {v, w) in the association 
graph, then we can transform the rooted maximum common consistent subtree 
problem into a max weighted clique problem. Solving this problem for each node 
in the association graph and looking for the maximum weight rooted common 
consistent subtree, we can find the solution to the minimum cost edited tree 
isomorphism problem. 

Let us assume that we know the weight of the isomorphism for every child 
of (v, w) in the association graph. We want to find the consistent set of siblings 
with greatest total weight. Let us construct an undirected graph whose nodes 
consist of the children of (v, w) in the association graph. We connect two nodes 
(p, q) and (r, s) if and only if there is no path connecting p and r in t and there 
is no path connecting q and s in t' . This means that we connect two matches 
(p, q) and (r, s) if and only if they match nodes that are consistent siblings in 
each tree. Furthermore, we assign to each association node (a, b) a weight equal 
to the weight of the maximum common consistent subtree rooted at (a, 6). The 
maximum weight clique of this graph will be the set of consistent siblings with 
maximum total weight. The weight of the maximum common consistent subtree 
rooted at (t>, w) will be this weight plus the minimum of the weights of v and 
w, i.e. the maximum weight that can be obtained by the match. Furthermore, 
the nodes of the clique will be the children of (v,w) in the maximum common 
consistent subtree. 



3.3 Heuristics for the Maximum Weighted Clique 



As we have seen, we have transformed an inexact tree matching problem into a 
series of maximum weighted clique problems. That is, we transformed one NP- 
complete problem into multiple NP-complete problems. The reason behind this 
approach lies in the fact that the max clique problem is, on average, a relatively 
easy problem. Furthermore, since the seminal paper of Barrow and Burstall [1], it 
is a standard technique for structural matching and a large number of approaches 
and very powerful heuristics exist to solve it or approximate it. 

The approach we will adopt to solve each single instance of the max weight 
clique problem is an evolutionary one introduced by Bomze, Pelillo and Stix 
[3]. This approach is based on a continuous formulation of the combinatorial 
problem and transforms it into a symmetric quadratic programming problem in 
the simplex A. For more detail we refer to the appendix. 

Relaxation labeling is a evidence combining process developed in the frame- 
work of constraint satisfaction problems. Its goal is to find a classification p that 
satisfies pairwise constraints and interactions between its elements. The process 
is determined by the update rule 



prw 






( 2 ) 



where the compatibility component is qi{X) = P)Pjip) ■ 
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In [12] Pelillo showed that the function A(p) = J2i\PiWliW is a Lyapunov 
function for the process, i.e. > 4l(p*), with equality if and only if p* is 

stationary. 

3.4 Putting It All Together 

In the previous sections we proved that the maximum edited tree isomorphism 
problem can be reduced to nm maximum weight clique problem and we have 
given an iterative process that is guaranteed to find maximal weight cliques. In 
this section we will show how to use these ideas to develop a practical algorithm. 

A direct way is to use the relaxation labeling dynamics starting from the 
leaves of the directed association graph and propagate the result upwards in 
the graph using the weight of the extracted clique to initialize the compat- 
ibility matrix of every parent association. For a subproblem rooted at (u,v) 
the compatibility coefficients can be calculated knowing the weight M of ev- 
ery isomorphism rooted at the descendants of u and v. Specifically, the com- 
patibility coefficients are initialized as R[u,v) = qee^ — C , or, equivalently, 

r(u,v){a,a'b,b') = 7 - where 

f if (o, a') = (&,&') 

4a,a')ib,b') = { cJr.a'Ka.a') + 4b’b'){b,b') «') ^nd (6, b') are consistent 

I 0 otherwise. 

This approach imposes a sequentiality to an otherwise highly parallel algo- 
rithm. An alternative can be obtained transforming the problem into a single 
multi-object labeling process. With this approach we set up a labeling problem 
with one object per node in the association graph, and at each iteration we up- 
date the label distribution for each object. We, then, update the compatibility 
matrices according to the new weight estimate. 

This multi-object approach uses the fact that the compatibility matrix for 
one rooted matching subproblem does not depend upon which nodes are matched 
below the root, but only on the number of matches. That is, to solve one sub- 
problem we need to know only the weight of the cliques rooted at the children, 
not the nodes that form the clique. 

Gibbons’ result [7] guarantees that the weight of the clique is equal to ’ 
where x is the characteristic vector of the clique and B is the weight matrix 
defined in (3). This allows us to generate an estimate of the clique at each 
iteration: given the current distribution of label probability p for the subproblem 
rooted at (u, v), we estimate the number of nodes matched under {u, v) as , 
and thus we assign to {u,v) the weight + min(w„, that is 

the weight of the maximum set of consistent descendants plus the the weight 
that can be obtained matching node u with node v. 

We obtain a two step update rule: at each iteration we update the label 
probability distribution according to equation (2), and then we use the updated 
distributions to generate new compatibility coefficients according to the rule 
ru,v{a, a', b, b') = j where 
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^{u,v) 

^{a,a'){h,h') 



{ — r) / / 

2”a,a t^a,a 

(u.v) . {u,v) 

*^(a,a')(a,a') ' ^{b,b'){b^b') 

0 



if (a, a') = {b, b') 

if (a, a') and (6, b') are consistent 
otherwise 



Another possible variation to the algorithm can be obtained using different 
initial assignments for the label distribution of each subproblem. 

A common approach is to initialize the assignment with a uniform distribu- 
tion so that we have an initial assignment close to the baricenter of the simplex. 
A problem with this approach is that the dimension of the basin of attraction 
of one maximal clique grows with the number of nodes in the clique. 

With our problem decomposition the wider cliques are the ones that map 
nodes at lower levels. As a result the solution will be biased towards matches 
that are very low on the graph, even if these matches require cutting a lot of 
nodes and are, thus, less likely to give an optimum solution. 

A way around this problem is to choose an initialization that assigns a higher 
initial likelihood to matches that are higher up on the subtree. In our experiments 
we decided to initialize the probability of the association (a, h) for the subproblem 
rooted at (u, u) as P(u,v){ 0 'yb) = where da is the depth of a with 

respect to u, is the depth of b with respect to v, and e is a small perturbation. 
Of course, we then renormalize to ensure that it is still in the simplex. 

4 Experimental Results 

We evaluate the new tree-matching method on the problem of shock-tree match- 
ing. The idea of characterizing boundary shape using the differential singularities 
of the reaction equation was first introduced into the computer vision literature 
by Kimia, Tannenbaum and Zucker [8]. The idea is to evolve the boundary of 
an object to a canonical skeletal form using the reaction-diffusion equation. The 
skeleton represents the singularities in the curve evolution, where inward moving 
boundaries collide. The reaction component of the boundary motion corresponds 
to morphological erosion of the boundary, while the diffusion component intro- 
duces curvature dependent boundary smoothing. In practice, the skeleton can 
be computed in a number of ways, here we use a variant of the method Siddiqi, 
Tannenbaum and Zucker, which solves the eikonal equation which underpins 
the reaction-diffusion analysis using the Hamilton- Jacobi formalism of classical 
mechanics [14]. Once the skeleton is to hand, the next step is to devise ways 
of using it to characterize the shape of the original boundary. Here we follow 
Zucker, Siddiqi, and others, by labeling points on the skeleton using so-called 
shock-labels [15]. According to this taxonomy of local differential structure, there 
are different classes associated with behavior of the radius of the osculating circle 
from the skeleton to the nearest pair of boundary points. The so-called shocks 
distinguish between the cases where the local osculating circle has maximum 
radius, minimum radius, constant radius or a radius which is strictly increasing 
or decreasing. We abstract the skeletons as trees in which the level in the tree is 
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determined by their time of formation [13,15]. The later the time of formation, 
and hence their proximity to the center of the shape, the higher the shock in the 
hierarchy. While this temporal notion of relevance can work well with isolated 
shocks (maxima and minima of the radius function), it fails on monotonically 
increasing or decreasing shock groups. To give an example, a protrusion that 
ends on a vertex will always have the earliest time of creation, regardless of its 
relative relevance to the shape. 

We generate two variants of this matching problem. In the first variant we 
use aq purely symbolic approach: Here the shock trees have uniform weight and 
we match only the structure. The second variant is weighted: we assign to each 
shock group a weight proportional to the length of the border that generates the 
shock; this value proves to be a good measure of skeletal similarity [17]. 

For our experiments we used a database consisting of 16 shapes. For each 
shape in the database, we computed the maximum edited isomorphism with the 
other shapes. In the unweighted version the “goodness” measure of the match 
is the average fraction of nodes matched, that is, W{t\,t 2 ) = | 
where indicates the number of nodes in the tree t. Conversely, to calculate 
the goodness of the weighted match we weights so that the sum over all the 
nodes of a tree is 1. The way we use the total weight of the maximum common 
edited isomorphism as a measure for the match. In figure 2 we show the shapes 
and the goodness measure of their match. The top value of each cell is the result 
for the unweighted case, the bottom value is represents the weighted match. 

To illustrate the usefullness of the set of similarity measures, we have used 
them as input to a pairwise clustering algorithm [9] . The aim here is see whether 
the clusters extracted from the weighted or the unweighted tree edit distance 
correspond more closely to the different perceptual shape categories in the data- 
base. In the unweighted case the process yielded six clusters (brush (1) + brush 
(2) + wrench (4); spanner (3) + horse (13) ; pliers (5) + pliers (6) + hammer (9) 
;pliers (7) +hammer (8) + horse (12); fish (10) + fish (12); hand (14) + hand (15) 
+ hand (16). Clearly there is merging and leakage between the different shape 
categories. Clustering on the weighted tree edit distances gives better results 
yielding seven clusters: brush (1) + brush (2) ; spanner (3) + spanner (4); pliers 
(5) + pliers (6) + pliers (7); hammer (8) + hammer (9); fish (10) + fish (11); 
horse (12) + horse (13); hand (14) + hand (15) + hand (16)). These correspond 
exactly to the shape categories in the data-base. 

5 Sensitivity Study 

To augment these real world experiments, we have performed a sensitivity anal- 
ysis. The aim here is to characterise the effects measurement errors resulting 
from noise or jitter on the weights and the structural errors resulting from node 
removal. 

Node removal tests the capability of the method to cope with structural mod- 
ification. To do this we remove an increasing fraction of nodes from a randomly 
generated tree. We then we match the modified tree against its unedited version. 
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Fig. 2. Matching result for unweighted (top) and weighted (bottom) shock trees 



Since we remove nodes only from one tree, the edited tree will have an exact 
match against the unedited version. Hence, we know the optimum value of the 
weight that should be attained by the maximum edited isomorphism. This is 
equalt to the total weight of the edited tree. 

By adding measurement errors or jitter to the weights, we test how the 
method copes with a modification in the weight distribution. The measurement 
errors are normally distributed with zero mean and controlled variance. Here we 
match the tree of noisy or jittered weights against its noise-free version. In this 
case we have no easy way to determine the optimal weight of the isomorphism, 
but we do expect a smooth drop in total weight with increasing noise variance. 

We performed the experiments on trees with 10, 15, 20, 25, and 30 nodes. For 
each experimental run we used 11 randomly generated trees. The procedure for 
generating the random trees was as follows: we commence with an empty tree 





450 



Andrea Torsello and Edwin R. Hancock 





Fig. 3. Sensitivity analysis: top-left node removal, top-right node removal without out- 
liers, bottom-left weight jitter, bottom-right weight jitter without outliers. 



(i.e. one with no nodes) and we iteratively add the required number of nodes. 
At each iteration nodes are added as children of one of the existing nodes. The 
parents are randomly selected with uniform probability from among the existing 
nodes. The weight of the newly added nodes are selected at random from an 
exponential distribution with mean 1. This procedure will tend to generate trees 
in which the branch ratio is highest closest to the root. This is quite realistic of 
real-world situations, since shock trees tend to have the same characteristic. 

The fraction of nodes removed was varied from 0% to 60%. In figure 3 top left 
we show the ratio of the computed weighted edit distance to the optimal value of 
the maximum isomorphism. Interestingly, for certain trees the relaxation algo- 
rithm failed to converge within the allotted number of iterations. Furthermore, 
the algorithm also failed to converge on the noise corrupted variants of these 
trees. In other cases, the algorithm exhibited particularly rapid convergence. 
Again, the variants of these trees also showed rapid algorithm convergence. When 
the method fails to converge in an allocated number of iterations, we can still 
give a lower bound to the weight. However, this bound is substantially lower 
than the average value obtained when the algorithm does converge. The top 
right-hand graph of figure 3 shows the ratio of weight matched when we elim- 
inate these convergence failures. The main conclusion that can be drawn from 
these two plots are as follows. First, the effect of increasing structural error is to 
cause a systematic underestimation of the weighted edit distance. The different 
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curves all exhibit a minimum value of the ratio. The reason for this is that the 
matching problem becomes trivial as the trees are decimated to extinction, 

The bottom row of figure 3 shows the results obtained when we have added 
measurement errors or jitter to the weights. We noise corrupted weights were 
obtained with randomly added Gaussian noise with standard deviation ranging 
from 0 to 0.6. The bottom left-hand graph shows the result of this test. It is clear 
that the matched weight decreases almost linearly with the noise standard de- 
viation. In these experiments, we encountered similar problems with algorithm 
non-convergence. Furthermore, the problematic instances were identical. This 
further supports the observation that the problem strongly depends on the in- 
stance. The bottom right-hand plot shows the results of the jitter test with the 
convergence failures removed. 

6 Conclusions 

In this paper we have investigated a optimization approach to tree matching. 
We based the work on to tree edit distance framework. We show that any tree 
obtained with a sequence of cut operation is a subtree of the transitive closure 
of the original tree. Furthermore we show that the necessary condition for any 
subtree to be a solution can be reduced a clique problem in a derived structure. 
Using this idea we transform the tree edit distance problem into a series of 
maximum weight cliques problems and then we use relaxation labeling to find 
an approximate solution. 

In a set of experiments we apply this algorithm to match shock graphs, a 
graph representation of the morphological skeleton. The results of these experi- 
ments are very encouraging, showing that the algorithm is able to match similar 
shapes together. Furthermore we provide some sensitivity analysis of the method. 

A Motzkin-Strauss Heuristic 

In 1965, Motzkin and Strauss [10] showed that the (unweighted) maximum 
clique problem can be reduced to a quadratic programming problem on the 
n-dimensional simplex Zi = {x G > 0 for all i = 1 . . . n, ^ Xj = 1}, here 

Xi are the components of vector x. More precisely, let G = (U, E) be a graph 
where V is the node set and E is the edge set, and let C C U be a maximum 
clique of G, then the vector x* = {x* = 1/#G if i G G, 0 otherwise}, maxi- 
mizes in A the function g{x) = x^Ax, where A is the adjacency matrix of G. 
Furthermore, given a set S' C V, we define the characteristic vector x^ 

Hies 

I 0 otherwise^ 

S is a maximum (maximal) clique if and only if g(x‘^) is a global (local) 
maximum for the function g. 
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Gibbons et al. [7] generalized this result to the weighted clique case. In their 
formulation the association graph is substituted with a matrix B = is 

related to the weights and connectivity of the graph by the relation 

i l/rvi if t = j 

(3) 

0 otherwise. 



Let us consider a weighted graph G = (V,E,w), where V is the set of nodes, 
E the set of edges, and re : R — > M a weight function that assigns a weight to 
each node. Gibbons et al. proved that, given a set S' C G and its characteristic 
vector x’® defined as 



w{i) 



T,jes “D 

0 



if z G S, 
otherwise. 



S is a maximum (maximal) weight clique if and only if x® is a global (local) 

minimizer for equation x^flx. Furthermore, the weight of the clique S is w{S) = 

1 

Unfortunately, under this formulation, the minima are not necessarily iso- 
lated: when we have more than one clique with the same maximal weight, any 
convex linear combinations of their characteristic vectors will give the same max- 
imal value. What this implies is that, if we find a minimizer x* we can derive the 
weight of the clique, but we might not be able to tell the nodes that constitute 
it. 

Bomze, Pelillo and Stix [3] introduce a regularization factor to the quadratic 
programming method that generates an equivalent problem with isolated solu- 
tions. The new quadratic program minimizes x^Cx in the simplex, where the 
matrix C = (ctj),,jgy is defined as 



2wi 

^ij — ^ii Cjj 

0 



a i = j 

if (z,j) ^ E,i^j 
otherwise. 



( 4 ) 



Once again, 5 is a maximum (maximal) weighted clique if and only if x’^ is a 
global (local) minimizer for the quadratic program. 

To solve the quadratic problem we transform it into the equivalent problem 
of maximizing x^( 7 ee^ — C)x, where e = (1, • • • , 1)^ is the vector with every 
component equal to 1 and 7 is a positive scaling constant. 
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Abstract. The interest of graph matching techniques in the pattern 
recognition field is increasing due to the versatility of representing knowl- 
edge in the form of graphs. However, the size of the graphs as well as the 
number of attributes they contain can be too high for optimization algo- 
rithms. This happens for instance in image recognition, where structures 
of an image to be recognized need to be matched with a model defined 
as a graph. 

In order to face this complexity problem, graph matching can be re- 
garded as a combinatorial optimization problem with constraints and 
it therefore it can be solved with evolutionary computation techniques 
such as Genetic Algorithms (GAs) and Estimation Distribution Algo- 
rithms (ED As). 

This work proposes the use of EDAs, both in the discrete and continuous 
domains, in order to solve the graph matching problem. As an example, 
a particular inexact graph matching problem applied to recognition of 
brain structures is shown. This paper compares the performance of these 
two paradigms for their use in graph matching. 



1 Introduction 

Many articles about representation of structural information by graphs in do- 
mains such as image interpretation and pattern recognition can be found in the 
literature [1]. In those, graph matching is used for structural recognition of im- 
ages: the model (which can be an atlas or a map depending on the application) 
is represented in the form of a graph, where each node contains information for 
a particular structure and arcs contain information about relationships between 
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structures; a data graph is generated from the images to be analyzed and con- 
tains similar information. Graph matching techniques are then used to determine 
which structure in the model corresponds to each of the structures in a given 
image. 

Most existing problems and methods in the graph matching domain assume 
graph isomorphism, where both graphs being matched have the same number of 
nodes and links. In some cases this bijective condition between the two graphs 
is too strong and it is necessary to weaken it and to express the correspondence 
as an inexact graph matching problem. 

When the generation of the data graph from an original image is done with- 
out the aid of an expert, it is difficult to segment accurately the image into 
meaningful entities, that is why over-segmentation techniques need to be ap- 
plied [1,2,3]. As a result, the number of nodes in the data graph increases and 
isomorphism condition between the model and data graphs cannot be assumed. 
Such problems call for inexact graph matching, and similar examples can be 
found in other fields. 

Several techniques have been applied to inexact graph matching, including 
combinatorial optimization [4,5,6], relaxation [7,8,9,10,11], EM algorithm [12,13], 
and evolutionary computation techniques such as Genetic Algorithms (GAs) [14,15]. 

This work proposes the use of Estimation Distribution Algorithm (EDA) 
techniques in both the discrete and continuous domains, showing the potential 
of this new evolutionary computation approach among traditional ones such as 
GAs. 

The outline of this work is as follows: Section 2 is a review of the EDA 
approach. Section 3 illustrates the inexact graph matching problem and shows 
how to face it with EDAs. Section 4 describes the experiment carried out and 
the results obtained. Einally, Section 5 gives the conclusions and suggests further 
work. 



2 Estimation Distribution Algorithms 

2.1 Introduction 

EDAs [16,17,18] are non-deterministic, stochastic heuristic search strategies that 
form part of the evolutionary computation approaches, where number of solu- 
tions or individuals are created every generation, evolving once and again until 
a satisfactory solution is achieved. In brief, the characteristic that most differ- 
entiates EDAs from other evolutionary search strategies such as GAs is that the 
evolution from a generation to the next one is done by estimating the probability 
distribution of the fittest individuals, and afterwards by sampling the induced 
model. This avoids the use of crossing or mutation operators, and the number 
of parameters that EDAs require is reduced considerably. 

In EDAs, the individuals are not said to contain genes, but variables which 
dependencies have to be analyzed. Also, while in other heuristics from evolution- 
ary computation the interrelations between the different variables representing 
the individuals are kept in mind implicitly (e.g. building block hypothesis), in 
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EDA 

Dq ■<— Generate N individuals (the initial population) randomly 

Repeat for I = 1,2, . . . until a stopping criterion is met 

Select Se < N individuals from -D;_i according to 
a selection method 

pi{x) = •(— Estimate the probability distribution 

of an individual being among the selected individuals 

Di •<— Sample N individuals (the new population) from pi{x) 



Fig. 1. Pseudocode for EDA approach. 



EDAs the interrelations are expressed explicitly through the joint probability 
distribution associated with the individuals selected at each iteration. The task 
of estimating the joint probability distribution associated with the database of 
the selected individuals from the previous generation constitutes the hardest 
work to perform, as this requires the adaptation of methods to learn models 
from data developed in the domain of probabilistic graphical models. 

Figure 1 shows the pseudocode of EDA, in which we distinguish four main 
steps in this approach: 

1. At the beginning, the first population Dq of N individuals is generated, 
usually by assuming an uniform distribution (either discrete or continuous) 
on each variable, and evaluating each of the individuals. 

2. Secondly, a number Se {Se < N) of individuals are selected, usually the 
fittest ones. 

3. Thirdly, the n-dimensional probabilistic model that better expresses the 
interdependencies between the n variables is induced. 

4. Next, the new population of N new individuals is obtained by simulating 
the probability distribution learned in the previous step. 

Steps 2, 3 and 4 are repeated until a stopping condition is verified. The most 
important step of this new paradigm is to find the interdependencies between 
the variables (step 3). This task will be done using techniques from the field of 
probabilistic graphical models. 

Next, some notation is introduced. Let X = {Xi , . . . , A„) be a set of random 
variables, and let be a value of Aj, the component of X. Let y = (xj)^ 
be a value olY C X. Then, a probabilistic graphical model for X is a graphical 
factorization of the joint generalized probability density function, p{X = x) (or 
simply p{x)). The representation of this model is given by two components: a 
structure and a set of local generalized probability densities. 

With regard to the structure of the model, the structure S for X is a directed 
acyclic graph (DAG) that describes a set of conditional independences between 
the variables on X. Paf represents the set of parents -variables from which 
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an arrow is coming out in S- of the variable Xj in the probabilistic graphical 
model, the structure of which is given by S. The structure S for X assumes that 
Xi and its non descendants are independent given Paf , i = 2, . . . ,n. Therefore, 
the factorization can be written as follows: 

n 

p{x) = p{xi, ... ,Xn) = Wp{x,, I paf). (1) 

1 

Furthermore, regarding the local generalized probability densities associated 
with the probabilistic graphical model, these are precisely the ones appearing in 
Equation 1. 

A representation of the models of the characteristics described above as- 
sumes that the local generalized probability densities depend on a finite set of 
parameters 6s £ &s, and as a result the previous equation can be rewritten as 
follows: 



p{x I 0s) = p{x, I paf, 6,) (2) 

i=l 

where 6s = ,0„)- 

After having defined both components of the probabilistic graphical model, 
the model itself will be represented by M = (S', 0s)- 



2.2 EDAs in Discrete Domains 

In the particular case where every variable Aj e A is discrete, the probabilistic 
graphical model is called Bayesian network [19]. If the variable Xi has possible 
values, xj, . . . ,a:f , the local distribution, p(x^ | pa\’^ ,0i) is: 

p{xi^ I Oi) = = 0.^jk (3) 

where paf'^, . . . ,paf’^ denotes the values of Paf , that is the set of parents 
of the variable A, in the structure S; is the number of different possible 
instantiations of the parent variables of A^. Thus, qi = ]f[x ePa ’’’a' local 
parameters are given by 0* = {{6ijk)k=i)'j=i) ■ In other words, the parameter 
Oijk represents the conditional probability that variable A* takes its value, 
knowing that the set of its parent variables take its value. We assume that 
every is greater than zero. 

All the EDAs are classified depending on the maximum number of depen- 
dencies between variables that they accept (maximum number of parents that a 
variable Xi can have in the probabilistic graphical model). 



Without Interdependencies. The Univariate Marginal Distribution Algo- 
rithm (UMDA) [20] is a representative example of this category, which can be 
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written as: 

n 

pi{x;0‘) = Y[pi{x,;el) (4) 

i=l 

where 0' = { 0 '^-^} is recalculated every generation by its maximum likelihood 

estimation, i.e. = 7 - 1 ■ i® number of cases on which the variable 

Xj takes the value when its parents are on their combination of values 
for the Z*'* generation, and ^Ijk- 

Pairwise Dependencies. An example of this second category is the greedy al- 
gorithm called MIMIC (Mutual Information Maximization for Input Clustering) 
[21]. The main idea in MIMIC is to describe the true mass joint probability as 
closely as possible by using only one univariate marginal probability and n — 1 
pairwise conditional probability functions. 



Multiple Interdependencies. We will use EBNA (Estimation of Bayesian 
Network Algorithm) [22] as an example of this category. The EBNA approach 
was introduced for the first time in [23], where the authors use the Bayesian 
Information Criterion (BIC) [24] as the score to evaluate the goodness of each 
structure found during the search. Following this criterion, the corresponding 
BIC score -BIC{S, D)- for a Bayesian network structure S constructed from a 
database D and containing N cases can be proved to be as follows: 



BIC{S, D) = EEE A^ijfclog 



N, 



ijk 



log A^ 



1=1 j = l k=l 









( 5 ) 



i=l 



where N^jk denotes the number of cases in D in which the variable Xi has the 
value xf and Pai is instantiated as its value, and N^j = YlJk=i ^ijk- 

Unfortunately, to obtain the best model all possible structures must be 
searched through, which has been proved to be NP-hard [25]. Even if promis- 
ing results have been obtained through global search techniques [26,27,28], their 
computation cost makes them impractical for our problem. As the aim is to find 
a model as good as possible -even if not the optimal- in a reasonable period of 
time, a simpler algorithm is preferred. An example of the latter is the so called 
Algorithm B [29] , which is a greedy search heuristic that begins with an arc-less 
structure and adds iteratively the arcs that produce maximum improvement ac- 
cording to the BIC approximation -but other measures can also be applied. The 
algorithm stops when adding another arc would not increase the score of the 
structure. 

Local search strategies are another way of obtaining good models. These 
begin with a given structure, and every step the addition or deletion of an arc 
that improves most the scoring measure is performed. Local search strategies 
stop when no modification of the structure improves the scoring measure. The 
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main drawback of local search strategies is their strong dependence on the initial 
structure. Nevertheless, since it has been shown in [30] that local search strategies 
perform quite well when the initial structure is reasonably good, the model of 
the previous generation could be used as the initial structure. 

The initial model Mo in EBNA, is formed by its structure 5'o which is an arc- 
less DAG and the local probability distributions given by the n unidimensional 
marginal probabilities p{Xi = Xj) = ^, i = 1, . . . , n -that is, Mq assigns the 
same probability to all individuals. The model of the first generation -M\- is 
learned using Algorithm B, while the rest of the models are learned following a 
local search strategy that received the model of the previous generation as the 
initial structure. 



Simulation in Bayesian Networks. In ED As, the simulation of Bayesian net- 
works is used merely as a tool to generate new individuals for the next population 
based on the structure learned previously. The method used in this work is the 
Probabilistic Logic Sampling (PLS) proposed in [31]. Following this method, the 
instantiations are done one variable at a time in a forward way, that is, a variable 
is not sampled until all its parents have already been so. 

2.3 EDAs in Continuous Domains 

In this section we introduce an example of the probabilistic graphical model 
paradigm that assumes the joint density function to be a multivariate Gaussian 
density. 

The local density function for the variable is computed as the linear- 
regression model 

/(x, I paf , = N{xp, rn, + ^ - nij), u,) (6) 

xjcpa. 

where Af{x; p, a^) is a univariate normal distribution with mean p and variance 
a^. 

Local parameters are given by 6i = {mi, bi, Vi), where bi = {bu , . . . , 
is a column vector. Local parameters are as follows: is the unconditional 

mean of Xi, is the conditional variance of Xi given Pai, and bji is a linear 
coefficient that measures the strength of the relationship between Xj and Aj . A 
probabilistic graphical model built from these local density functions is known as 
a Gaussian network [32] . Gaussian networks are of interest in continuous EDAs 
because the number of parameters needed to specify a multivariate Gaussian 
density is smaller. 

Next, an analogous classification of continuous EDAs as for the discrete do- 
main is done, in which these continuous EDAs are also classified depending on 
the number of dependencies they take into account. 



Without Dependencies. In this case, the joint density function is assumed to 
follow a n-dimensional normal distribution, and thus it is factorized as a product 
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of n unidimensional and independent normal densities. Using the mathematical 
notation X = A/*(cc; //, ^), this assumption can be expressed as: 

n n 

X': M. E) = n Ml, = n 

i—l 1—1 

An example of continuous EDAs in this category is UMDA^ [33]. 




-H- 






(7) 



Bivariate Dependencies. An example of this category is MIMICjf [33], which 
is basically an adaptation of the MIMIC algorithm [21] to the continuous domain. 



Multiple Dependencies. Algorithms in this section are approaches of EDAs 
for continuous domains in which there is no restriction in the learning of the 
density function every generation. An example of this category is EGNA BCe 
(Estimation of Gaussian Network Algorithm) [33]. The method used to find the 
Gaussian network structure is a Bayesian score+search. In EGNA^Ce a local 
search is used to search for good structures. 

Simulation of Gaussian Networks. A general approach for sampling from 
multivariate normal distributions is known as the conditioning method, which 
generates instances of X by sampling Xi, then X 2 conditionally to Xi, and so 
on. The simulation of a univariate normal distribution can be done with a simple 
method based on the sum of 12 uniform variables. 

3 Graph Matching as a Combinatorial Optimization 
Problem with Constraints 

3.1 Traditional Representation of Individuals 

The choice of an adequate individual representation is a very important step 
in any problem to be solved with heuristics that will determine the behavior 
of the search. An individual represents a point in the search space that has to 
be evaluated, and therefore is a solution. For a graph matching problem, each 
solution represents a match between the nodes of a data graph G 2 and those of 
model graph Gi. 

A possible representation that has already been used either in GAs or discrete 
EDAs [34] consists of individuals with | V 2 \ variables, where each variable can take 
any value between 1 and | Ui | . More formally, the individual as well as the solution 
it represents could be defined as follows: for 1 < k < |Ui| and 1 < i < \V 2 \, X^ = k 
means that the node of G 2 is matched with the node of Gi. 



3.2 Representing a Matching as a Permutation 

Permutation-based representations have been typically applied to problems such 
as the Travelling Salesman Problem (TSP), but they can also be used for inexact 
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graph matching. In this case the meaning of the individual is completely different, 
as an individual does not show directly which node of G2 is matched with each 
node of Gi. In fact, what we obtain from each individual is the order in which 
nodes will be analyzed and treated so as to compute the matching solution that 
it is representing. 

For the individuals to contain a permutation, the individuals will have the 
same size as the traditional ones described in Section 3.1 (i.e. IV 2 I variables long). 
However, the number of values that each variable can take will be | V 2 1 , and not 

Vi| as in that representation. In fact, it is important to note that a permutation 
is a list of numbers in which all the values from 1 to n have to appear in an 
individual of size n. In other words, our new representation of individuals needs 
to satisfy a strong constraint in order to be considered as correct, that is, they 
all have to contain every value from 1 to n, where n = IV 2 I. 

More formally, for 1 < fc < IV 2 I and 1 < i < \V 2 \, Xi = k means that the 
node of G2 will be the node that is analyzed for its most appropriate match. 

Now it is important to define a procedure to obtain the solution that each 
permutation symbolizes. As this procedure will be done for each individual, it is 
important that this translation is performed by a fast and simple algorithm. A 
way of doing this is introduced next. 

A solution for the inexact graph matching problem can be calculated by 
comparing the nodes to each other and deciding which is more similar to which 
using a similarity function defined to compute the similarity between 

nodes i and j. The similarity measures used so far in the literature have been 
applied to two nodes, one from each graph, and their aim was to help in the 
computation of the fitness of a solution, that is, the final value of a fitness 
function. However, the similarity measure zu{i,j) proposed in this work is quite 
different, as these two nodes to be evaluated are both in the data graph {i,j G V2) 
-see Section 4.3 for more details. With these new similarity values we will identify 
for each particular node of G2 which other nodes in the data graph are most 
similar to it, and try to group it with the best set of already matched nodes. 

Given an individual x= {x\, . . . , X|vq, X|Vij+i, . ■ • ,X\V2\)^ the procedure to do 
the translation is performed in two phases as follows: 

1. The first |Vi| values {x\, ... , x\v^\) that directly represent nodes of V 2 will 
be matched to nodes 1, 2 ... , |Vi| (that is, the node x\ G V2 is matched with 
the node 1 G Ifi, the node X2 G V 2 is matched with the node 2 G Vi, and so 
on, until the node x\v^\ G V2 is matched with the node |Vi| G Vi). 

2. For each of the following values of the individual, {x\Vi\+\, ■ ■ ■ ,x\V 2 \)^ 
following their order of appearance in the individual, the most similar node 
will be chosen from all the previous values in the individual by means of the 
similarity measure w. For each of these nodes of G 2 , we assign the matched 
node of Gi that is matched to the most similar node of G 2 . 

The first phase is very important in the generation of the individual, as this 
is also the one that ensures the correctness of the solution represented by the 
permutation: all the values of Ifi are assigned from the beginning, and as we 
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assumed IV 2 I > |^i|) we conclude that all the nodes of Gi will have at least a 
occurrence in the solution represented by any permutation. 

Looking for correct individuals 

As explained in Section 2.2, the simulation process is PLS [31]. But a simple 
PLS algorithm will not take into account any restriction the individuals must 
have for a particular problem. The interested reader can find a more exhaustive 
review of this topic in [34], where the authors propose different methods to obtain 
only correct individuals that satisfy the particular constraints of the problem. 



3.3 Obtaining a Permutation with Continuous EDAs 

Continuous EDAs provide the search with other types of EDA algorithms that 
can be more suitable for some problems. But again, the main goal is to find a 
representation of individuals and a procedure to obtain an univocal solution to 
the matching from each of the possible permutations. 

In this case we propose a strategy based on the previous section, trying to 
translate the individual in the continuous domain to a correct permutation in 
the discrete one, evaluating it as explained in Section 3.2. This procedure has to 
be performed for each individual in order to be evaluated. Again, this process 
has to be fast enough in order to reduce computation time. 

With all these aspects in mind, individuals of the same size (n = IV 2 I) will be 
defined, where each of the variables of the individual can take any value following 
a Gaussian distribution. This new representation of individuals is a continuous 
value in M” that does not provide directly the solution it symbolizes: the values 
for each of the variables only show the way to translate from the continuous 
world to a permutation, and it does not contain similarity values between nodes 
of both graphs. This new type of representation can also be regarded as a way 
to focus the search from the continuous world, where the techniques that can be 
applied to the estimation of densities are completely different. 

In order to obtain a translation to a discrete permutation individual, we 
propose to order the continuous values of the individual, and to set its corre- 
sponding discrete values by assigning to each Xj e {1, . . . , IV 2 I} the respective 
order in the continuous individual. The procedure described in this section is 
further described in [35]. 

4 Experimental Results. The Human Brain Example 

4.1 Overview of the Human Brain Example 

The example chosen to test the performance of the different EDAs for permutat- 
ion-based representations in inexact graph matching is a problem of recognition 
of regions in 3D Magnetic Resonance Images (MRI) of the brain. The data 
graph G 2 = (V 2 , £' 2 ) is generated after over-segmenting an image and contains a 
node for each segmented region (subset of a brain structure). The model graph 
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Gi = {Vi, El) contains a node for each of the brain regions to be recognized. 
The experiments carried out in this chapter are focused on this type of graphs, 
but could similarly be adapted to any other inexact graph matching problem. 

More specifically, the model graph was obtained from the main structures 
of the the inner part of the brain (the brainstem). This example is a reduced 
version of the brain images recognition problem in [1]. In our case the number 
of nodes of G 2 (number of structures of the image to be recognized) is 94, and 
contains 2868 arcs. The model graph contains 13 nodes and 84 arcs. 



4.2 Description of the Experiment 

This section compares EDA algorithms each other and to a broadly known GA, 
the GENITOR [36], which is a steady state type algorithm (ssGA). 

Both EDAs and GENITOR were implemented in ANSI C++ language, and 
the experiment was executed on a two processor Ultra 80 Sun computer under 
Solaris version 7 with 1 GByte of RAM. 

The initial population for all the algorithms was created using the same ran- 
dom generation procedure based on a uniform distribution. The fitness function 
used is described later in Section 4.4. 

In the discrete case, all the algorithms were set to finish the search when a 
maximum of 100 generations or when uniformity in the population was reached. 
GENITOR, as it is a ssGA algorithm, only generates two individuals at each 
iteration, but it was also programmed in order to generate the same number of 
individuals as in discrete EDAs by allowing more iterations (201900 individuals). 
In the continuous case, the ending criterion was to reach 301850 evaluations (i.e. 
number of individuals generated). 

In EDAs, the following parameters were used: a population of 2000 individ- 
uals {N = 2000), from which a subset of the best 1000 are selected {Se = 1000) 
to estimate the probability, and the elitist approach was chosen (that is, always 
the best individual is included for the next population and 1999 individuals are 
simulated). In GENITOR a population of 2000 individuals was also set, with a 
mutation rate of and a crossover probability of pc = 1. The operators 

used in GENITOR where CX [37] and EM [38]. 



4.3 Definition of the Similarity Function 

Speaking about the similarity concept, we have used only a similarity measure 
based on the grey level distribution, so that the function zu returns a higher 
value for two nodes when the grey level distribution over two segments of the 
data image is more similar. In addition, no clustering process is performed, and 
therefore the similarity measure zu is kept constant during the generation of 
individuals. These decisions have been made knowing the nature and properties 
of an MRI image. More formally, the function zu can be defined as the set of 
functions that measure the correspondence between the two nodes of the data 
graph G 2 : zu = : V 2 [0, 1 ],U 2 € V 2 }. 
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4.4 Definition of the Fitness Function 



We have chosen a function proposed in [1] as an example. Following this function, 
an individual x~ (xi, . . . will be evaluated as follows: 



/ (x, P(j^ cr) 



1 

IV2IIV1 



|V'2| |Vl 

EE 



i=\ j=i 




(1 - a) 



\E2\\Ei\ 



E 



1 - 






(8) 



e[ = (u\,v{ )eEi e^ = (u{,vi )6i-’2 



where 



f 1 if W = J 

I 0 otherwise, 



a is a parameter used to adapt the weight of node and arc correspondences in 
/. For each u\ G Vi, p„^ is a function from V 2 into [0, 1] that measures the 
correspondence between u\ and each node of V 2 - Similarly, for each ei G E\, 
p^ is the set of functions from E 2 into [0, 1] that measure the correspondence 
between the arcs of both graphs G\ and G 2 . The value of / associated for each 
variable returns the goodness of the matching. Typically p„ and are related 
to the similarities between node and arc properties respectively. 

Node properties are described as attributes on grey level and size, while edge 
properties correspond to spatial relationships between nodes. 



4.5 Experimental Results 

Results such as the best individual obtained, the computation time, and the 
number of evaluations to reach the final solution were recorded for each of the 
experiments. The computation time obtained is the CPU time of the process for 
each execution, and therefore it is not dependent on the load of the system. The 
latter is given as a measure to illustrate the different computation complexity of 
all the algorithms. 

Each algorithm was executed 10 times. The non-parametric tests of Kruskal- 
Wallis and Mann- Whitney were used to test the null hypothesis of the same 
distribution densities for all -or some- of them. This task was done with the 
statistical package S.P.S.S. release 9.00. The results for the tests applied to all 
the algorithms are shown in Table 1. The study of particular algorithms gives 
the following results: 

— Between algorithms of similar complexity only: 

• UMDA vs. UMDA^,. Fitness value: p < 0.001; CPU time: p < 0.001; 

Evaluations: p < 0.001. 
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Table 1. Mean values of experimental results after 10 executions for each algorithm 
of the inexact graph matching problem of the Human Brain example. 





Best fitness value 


Execution time 


Number of evaluations 


UMDA 


0.718623 


00:53:29 


85958 


UMDAc 


0.745036 


03:01:05 


301850 


MIMIC 


0.702707 


00:57:30 


83179 


MIMICc 


0.747970 


03:01:07 


301850 


EBNA 


0.716723 


01:50:39 


85958 


EGNA 


0.746893 


04:13:39 


301850 


ssGA 


0.693575 


07:31:26 


201900 




p < 0.001 


p < 0.001 


p < 0.001 



• MIMIC vs. MIMICc. Fitness value: p < 0.001; CPU time: p < 0.001; 
Evaluations: p < 0.001. 

• EBNA vs. EGNA. Eitness value: p < 0.001; CPU time: p < 0.001; Eval- 
uations: p < 0.001. 

These results show that the differences between ED As in the discrete and 
continuous domains are significant in all the cases analyzed, meaning that 
the behavior of selecting a discrete learning algorithm or its equivalent in 
the continuous domain is very different. It is important to note that the 
number of evaluations was expected to be different, as the ending criteria 
for the discrete and continuous domains were also different. In all the cases, 
continuous EDAs obtained a fitter individual, but the CPU time and number 
of individuals created was also bigger. 

— Between discrete algorithms only: 

• Eitness value: p < 0.001. CPU time: p < 0.001. Evaluations: p < 0.001. 
In this case significant results are also obtained in fitness value, CPU time, 
and number of evaluations. The discrete algorithm that obtained the best 
result was UMDA, closely followed by EBNA. The differences in the CPU 
time are also according to the complexity of the learning algorithm they 
apply. Einally, the results show that MIMIC required significantly less indi- 
viduals to converge (to reach the uniformity in the population), whereas the 
other two EDA algorithms require nearly the same number of evaluations to 
converge. The genetic algorithm GENITOR is far behind the performance 
of EDAs. The computation time is also a factor to consider: the fact that 
GENITOR requires about 7 hours for each execution shows the complexity 
of the graph matching problem. 

— Between continuous algorithms only: 

• Eitness value: p = 0.342. CPU time: p < 0.001. Evaluations: p = 1.000. 
Differences between all the continuous EDAs appear to be not significant. 
As expected, the CPU time required for each of them is according to the 
complexity of the learning algorithm. On the other hand, the fact of having 
the same number of evaluations is due to the same ending criterion. Speaking 
about the differences in computation time between discrete and continuous 
EDA algorithms, it is important to note that the latter ones require all the 
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300000 individuals to be generated before they finish the search. The com- 
putation time for the continuous algorithms is also longer than the discrete 
equivalents as a result of several factors: firstly, due to the higher number 
of evaluations they perform each execution, secondly because of the longer 
individual-to-solution translation procedure that has to be done for each of 
the individuals generated, and lastly, as a result of the longer time required 
to learn the model in continuous spaces. 

We can conclude from the results that generally speaking continuous algo- 
rithms perform better than discrete ones, either when comparing all of them in 
general or only with algorithms of equivalent complexity. 

5 Conclusions and Further Work 

This work describes the application of the EDA approach to graph matching. 
Different individual representations have been shown in order to allow the use 
of discrete and continuous representation and algorithms. 

In an experiment with real data a comparison of the performance of this new 
approach between the discrete and continuous domains has been done, and con- 
tinuous ED As have shown a better performance looking at the fittest individual 
obtained, however a longer execution time and more evaluations were required. 
Additionally, other fitness functions should be tested with this new approach. 
Techniques such as [39,40] could also help to introduce better similarity measures 
and therefore improve the results obtained considerably. 

For the near future there are several tasks to be done. The most important 
is to perform more experiments with more data images (more data graphs) 
in order to evaluate the effectiveness of the proposed matching heuristic with 
more examples. In addition, a deeper study on the influence of node and arc 
correspondences requires also to be done. These new experiments are expected 
to highlight the importance of the structural aspects (the edges) as appreciated 
in our recent work. 
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Abstract. Graph matching is a problem that pervades computer vision 
and pattern recognition research. During the past few decades, two rad- 
ically distinct approaches have been pursued to tackle it. The first views 
the matching problem as one of explicit search in state-space. A classi- 
cal method within this class consists of transforming it in the equivalent 
problem of finding a maximal clique in a derived “association graph.” 
In the second approach, the matching problem is viewed as one of en- 
ergy minimization. Recently, we have provided a unifying framework for 
graph matching which is centered around a remarkable result proved by 
Motzkin and Straus in the mid-sixties. This allows us to formulate the 
maximum clique problem in terms of a continuous quadratic optimization 
problem. In this paper we propose a new framework for graph match- 
ing based on the linear complementarity problem (LCP) arising from 
the Motzkin-Straus program. We develop a pivoting-based technique to 
find a solutions for our LCP which is a variant of Lemke’s well-known 
method. Preliminary experiments are presented which demonstrate the 
effectiveness of the proposed approach. 



1 Introduction 

Graph matching is a fundamental problem in computer vision and pattern recog- 
nition, and a great deal of effort has been devoted over the past decades to devise 
efficient and robust algorithms for it (see [8] for an update on recent develop- 
ments). Basically, two radically distinct approaches have emerged, a distinction 
which reflects the well-known dichotomy originated in the artificial intelligence 
field between “symbolic” and “numerical” methods. The first approach views the 
matching problem as one of explicit search in state-space (see, e.g., [15,24,25]). 
The pioneering work of Ambler et al. [1] falls into this class. Their approach is 
based on the idea that graph matching is equivalent to the problem of finding 
maximal cliques in the so-called association graph, an auxiliary graph derived 
from the structures being matched. This framework is attractive because it casts 
the matching problem in terms of a pure graph-theoretic problem, for which a 
solid theory and powerful algorithms have been developed [7] . Since its introduc- 
tion, the association graph technique has been successfully applied to a variety 
of computer vision problems (e.g., [4,12]). 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 469-479, 2001. 
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In the second approach, the relational matching problem is viewed as one of 
energy minimization. In this case, an energy (or objective) function is sought 
whose minimizers correspond to the solutions of the original problem, and a 
dynamical system, usually embedded into a parallel relaxation network, is used 
to minimize it [11,13,23,26]. Typically, these methods do not solve the problem 
exactly, but only in approximation terms. Energy minimization algorithms are 
attractive because they are amenable to parallel hardware implementation and 
also offer the advantage of biological plausibility. 

In a recent paper [17], we have developed a new framework for graph match- 
ing which does unify the two approaches just described, thereby inheriting the 
attractive features of both. The approach is centered around a remarkable result 
proved by Motzkin and Straus in the mid-1960s, and more recently expanded by 
many authors [5,10,20,6], which allows us to map the maximum clique problem 
onto the problem of extremizing a quadratic form over a linearly constrained 
domain (i.e., the standard simplex in Euclidean space). Local gradient-based 
search methods such as replicator dynamics [19] have proven to be remarkably 
effective when we restrict ourselves to simple version of the problem, such as tree 
matching [21] or graph isomorphism [18]. However, for more difficult problems 
the challenge remains to develop powerful heuristics. 

It is a well-known fact that stationary points of quadratic programs can be 
characterized in terms of solutions of a linear complementarity problem (LCP), 
a class of inequality systems for which a rich theory and a large number of algo- 
rithms have been developed [9]. Hence, once that the graph matching is formu- 
lated in terms of a quadratic programming problem, the use of LCP algorithms 
naturally suggests itself, and this is precisely the idea proposed in the present 
paper. Among the many LCP methods presented in the literature, pivoting pro- 
cedures are widely used and within this class Lemke’s method is certainly the 
best known. Unfortunately, like other pivoting schemes, its finite convergence 
is guaranteed only for non-degenerate problems, and ours is indeed degenerate. 
The inherent degeneracy of the problem, however, is beneficial as it leaves free- 
dom in choosing the driving variable, and we exploited this property to develop a 
variant of Lemke’s algorithm which uses a new and effective “look-ahead” pivot 
rule. The procedure depends critically on the choice of a vertex in the graph 
which identihes the driving variable in the pivoting process. Since there is no 
obvious way to determine such a vertex in an optimal manner, we resorted to 
iterate this procedure over most, if not all, vertices in the graph. The resulting 
pivoting-based heuristic has been tested on various instances of random graphs 
and the preliminary results obtained confirm the effectiveness of the proposed 
approach. 

2 Graph Matching and Linear Complementarity 

Given two graphs Gi = (Vi,Ei) and G 2 = {V 2 ,E 2 ), an isomorphism between 
them is any bijection <f> : V\ ^ V 2 such that {i,j) £ E\ ^ {(f>{i) , 4>{j)) £ E 2 , for 
all i,j £ Vi- Two graphs are said to be isomorphic if there exists an isomorphism 




A Complementary Pivoting Approach to Graph Matching 471 



between them. The maximum common subgraph problem consists of finding the 
largest isomorphic subgraphs of Gi and G2. A simpler version of this problem is 
to find a maximal common subgraph, i.e., an isomorphism between subgraphs 
which is not included in any larger subgraph isomorphism. 

The association graph derived from Gi = (Vi.Ei) and G2 = (V2,£^2) is the 
undirected graph G = {V, E) defined as follows: 



V = VixV2 



and 

E = {((^, h), {j, k)) eV xV : i jtz h jtz k, and (z, j) e ifi 44 {h, k) G E^) ■ 

Given an arbitrary undirected graph G = {V,E), a subset of vertices G is called 
a clique if all its vertices are mutually adjacent, i.e., for all i,j G G, with i ^ j, 
we have (z,j) e E. A clique is said to be maximal if it is not contained in any 
larger clique, and maximum if it is the largest clique in the graph. The clique 
number, denoted by iu(G), is defined as the cardinality of the maximum clique. 

The following result establishes an equivalence between the graph matching 
problem and the maximum clique problem (see, e.g., [2]). 

Theorem 1 . Let G\ = {Vi,Ei) and G2 = (V2,i?2) be two graphs, and let G be 
the corresponding association graph. Then, all maximal (maximum) cliques in G 
are in one-to-one correspondence with maximal (maximum) common subgraph 
isomorphisms between G\ and G2. 

Now, let G = {V,E) be an arbitrary graph of order n, and let Sn denote the 
standard simplex of IR” : 

Sn = { X € IR” : e^x = 1 and > 0, z = 1 . . . n } 

where e is the vector whose components equal 1 , and a “T” denotes transposition. 
Given a subset of vertices G of G, we will denote by its characteristic vector 
which is the point in Sn defined as 

ri/|G|,ifzGG 
® (0, otherwise 

where |G| denotes the cardinality of G. 

Consider the following quadratic program 

min fdx) = x^Aqx 
S.t. X € Sn 

where Aq = (u^) is the n x n symmetric matrix defined as 

f if ^ = i 

a^j = < 0, if i j and (z, j) G E 
[1, if z 7^ j and (z,j) ^ E 



( 1 ) 



( 2 ) 
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The following theorem, recently proved by Bomze [5], expands on the 
Motzkin-Straus theorem [16], a remarkable result which establishes a connec- 
tion between the maximum clique problem and certain standard quadratic pro- 
grams. This has an intriguing computational significance in that it allows us to 
shift from the discrete to the continuous domain in an elegant manner. 

Theorem 2. Let C be a subset of vertices of a graph G, and let he its ehar- 
aeteristic veetor. Then, C is a maximal (maximum) elique of G if and only if 
is a local (global) solution of program (1). Moreover, all loeal (and hence global) 
solutions of (1) are strict. 

Unlike the original Motzkin-Straus formulation, which is plagued by the pres- 
ence of “spurious” solutions [20], the previous result guarantees us that all so- 
lutions of (1) are strict, and are characteristic vectors of maximal/maximum 
cliques in the graph. In a formal sense, therefore, a one-to-one correspondence 
exists between maximal cliques and local minimizers of fa in on the one 
hand, and maximum cliques and global minimizers on the other hand. 

The following result, which is a straightforward consequence of Theorems 1 
and 2, establishes an elegant connection between graph matching and quadratic 
programming. 

Theorem 3. Let Gi = and G 2 = {V 2 ,E 2 ) he two graphs, and let G 

be the corresponding association graph. Then, all local (global) solutions to (1) 
are in one-to-one correspondence with maximal (maximum) common subgraph 
isomorphisms between G\ and G 2 . 

Computing the stationary points of (1) can be done by solving the LCP 
{qg, Mg), which is the problem of finding a vector x satisfying the system 

?/ = + Mqx >0, a; > 0, x'^y = 0 



where 



9G 



0 

0 

-1 

1 



Mg 



Aq — e e 
0 0 
0 0 



( 3 ) 



With the above definitions, it is well known that if 2 : is a complementary solution 
of (qgtMg) with 2 ^ = [x^ ,y^) and x G M", then x is a stationary point of 
(1). Indeed, the matrix Aq is always strictly copositive, hence so is Mg and 
that is enough to assure that [qG,MG) always has a solution [9]^. Of course, a 
stationary point of (1) is not necessarily a local minimum, but in practice this is 
not a problem since there are several techniques that, starting from a stationary 
point can reach a nearby local optimum. An example is given by the replicator 
dynamics [19], but see [14] for a complete discussion on this topic. 



^ Recall that, given a cone E C IR", a symmetric matrix Q is said to be F -copositive 
if x^Qx > 0 for all x € U. If the inequality holds strictly for all x E F \ {o}, then Q 
is said to be strictly F -copositive. 
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3 A Pivoting-Based Heuristic for Graph Matching 

Technical literature supplies a large number of algorithms to go about solving an 
LCP [9]. The most popular is probably Lemke’s method, largely for its ability 
to provide a solution for a large number of matrix classes. Lemke’s “Scheme I” 
belongs to the family of pivoting algorithms. Given the generic LCP (q,M), it 
deals with the augmented problem {q,d,M) defined by 

y = q+[M,d\ ^ >0, 0 > 0, x > 0, x^y = 0. (4) 

A solution of (q,d,M) with 0 = 0 promptly yields a solution to {q,M), and 
Lemke’s method intends to compute precisely such a solution. We refer to [9] 
for a detailed description of Lemke’s algorithm. In our implementation, we chose 
d = e, as our problem does not expose peculiarities that would justify a deviation 
from this common practice. 

As usually done for outlining pivoting algorithms, we will use an exponent for 
the problem data. In practice q'^ and will identify the situation after ly pivots 
and Aq will indicate the nxn leading principal submatrix of M'^. Consistently, 
y'^ and x‘' will indicate the vectors of basic and non-basic variables, respectively, 
each made up of a combination of the original x,; and yi variables. The notation 
{^i^y'j) bs used to indicate pivoting transformations. The index set of the 
basic variables that satisfy the min-ratio test at iteration v will be denoted with 
Q'' , i.e. 

Q’' = arg min < — ^ < 0 

* I Ks 

where s is the index of the driving column. Also, in the sequel the auxiliary 
column that contains the covering vector d in (4) will be referred to as the 
column n + 3 of matrix M = Mq ■ 

In general, assuming an LCP non-degenerate is a strategy commonly taken 
to prove finiteness of pivoting schemes. This assumption amounts to having 
17“^ I = 1 for all V, thereby excluding any cycling behavior. In particular, Lemke’s 
method is guaranteed to process any non-degenerate problem {q,M) where M 
is strictly lR"-copositive, and to do so without terminating on a secondary ray 
[9]. Unfortunately our LCP {qciMc) is degenerate and standard degeneracy 
resolution strategies have proven to yield unsatisfactory results [14]. 

The proposed degeneracy resolution technique makes use of the so-called 
least-index rule which amounts to blocking the driving variable with a basic one 
that has minimum index within a certain subset of i.e. r = min<^'^ for some 
d?" C 17^. The least-index rule per se does not guarantee convergence. In fact 
we can ensure termination by choosing the blocking variable only among those 
that make the number of degenerate variables decrease as slowly as possible, i.e. 
among the index-set 

= argmin{|/7'"| - \ \ > 0 : i € f2^} C 
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where is the index-set of those variables that would satisfy the min-ratio 

test at iteration + 1 if the driving variable at iteration was blocked with 
as f £ f?". The previous conditional implies that a pivot step is taken and then 
reset in a sort of “look-ahead” fashion, hence we will refer to this rule as the 
look-ahead (pivot) rule. 

Before actually proceeding to illustrate a version of Lemke’s algorithm ap- 
plied to our matching problem, let us take a look at the tableaus that it generates. 
This will help us to identify regularities that are reflected in the behavior of the 
algorithm itself. The initial tableau follows: 





1 


Xi ■ ■■ Xn 


^n+1 


^n+2 


e 


yi 


0 




-1 


1 


1 






Ag 








Un 


0 




-1 


1 


1 


Vn+l 


-1 


1 ■■■ 1 


0 


0 


T 


Vn+2 


1 


-1 1 


0 


0 


1 



As qn+i is the only negative entry for the column of g, the first pivot to occur 
during initialization is (j/n+i) thereby producing the following transformation: 





1 


Xi ■ 




^n+1 


^n+2 


2/n+l 


yi 


I 






-1 


1 


1 






Ag 


T 

— ee 








Vn 


1 






-1 


1 


1 


e 


I 


-1 ■ 


— 1 


0 


0 


1 


Vn+2 


2 


-2 ■ 


2 


0 


0 


1 



The driving variable for the second pivot is Since = —1 for all 

i = 1, . . . , n it is immediate to see that the relative blocking variable can be any 
one of yi, . . . ,y„. In this case we apply no degeneracy resolution criterion, but 
rather allow for user intervention by catering for the possibility of deciding the 
second driving variable a priori. Let thus Pp be the (arbitrary) variable that shall 
block Xn+i- After performing (yp,x„+i), we have the following tableau: 





1 


Xl ■ ■ ■ Xn 


Vp 


^n+2 


Vn+l 


yi 


0 




1 


0 


0 


Vp—i 


0 




1 


0 


0 


^n+l 


1 


Ag,p 


-1 


1 


1 


Vp+1 


0 




1 


0 


0 


Vn 


0 




1 


0 


0 


e 


I 


-1 1 


0 


0 


1 


Vn+2 


2 


-2 2 


0 


0 


1 
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Algorithm 3.1 Lemke’s “Scheme I” with the look-ahead rule. 

Input: A graph G = {V, E) and p E V. 

Let {qG,e-,Mc) be the augmented LCP, where qc and Mg are defined in (3). 

h' i 0, T O') j r' ^ A 1 5 (^p 5 ^n^\) 

The driving variable is Xp. 

Infinite loop 

1/ ■(— + 1 

Let Xg denote the driving variable. 

17“ = argmini | ^ < o| 

If 1 17“ I = 1 then r = min 17“ 

else = argmini {|f 2 “| — > 0 : f £ ! 2 “}, r = min#“. 

{Vren>',Xs) 

If j/“ = 0 then 

Let X denote the complementary solution of (qG,MG)- 
The result is supp (x) n V 

The new driving variable is the complementary of 3 /“. 



Algorithm 3.2 The pivoting-based heuristic (PBH) for graph matching. 
Input: Two graphs Gi and G 2 . 

Construct the association graph G = (V,E) of Gi and G 2 . 

Let G' = (V' , E') be a permutation of G 

with deg(w') > deg(u') for all m',x' 6 V' with u' < v' . 

K* 

For v' = 1,. .. ,n : deg {v') > \K*\ do 

Run Algorithm 3.1 with G' and v' as input. 

Let K be the obtained result. 

If \K\ > \K*\, then K* g- K. 

The result is the mapping of K* in G. 



where Ac,p denotes the matrix whose rows are defined as 
(An ) = l 

^ X (^g)j - {^g)p otherwise. 

Algorithm 3.1 formalizes the previous statements. 

Empirical evidence indicated p as a key parameter for the quality of the final 
result of Algorithm 3.1. Unfortunately we could not identify any effective means 
to restrict the choice of values in V that can guarantee a good sub-optimal 
solution. We thus had to consider iterating for most, if not all, vertices of V 
as outlined in Algorithm 3.2. There we employ a very simple criterion to avoid 
considering those nodes that cannot drive to larger cliques than the one we 
already have because they have a too small degree. Clearly such a criterion is 
effective only for very sparse graphs. 
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We also observed that the schema is sensitive to the ordering of nodes and 
found that the best figures were obtained by reordering G by decreasing node 
degrees. This feature too is formalized in Algorithm 3.2. We will refer to that 
scheme by the name Pivoting Based Heuristic (PBH). 

4 Experimental Results 

In this section we present some experimental results of applying PBH to the 
problem of matching pairs of random graphs. Random structures represent a 
useful benchmark not only because they are not constrained to any particular 
application, but also because it is simple to replicate experiments and hence to 
make comparisons with other algorithms. 

We generated random 50-nodes graphs using edge-probablities (i.e. densities) 
ranging from 0.1 to 0.95. For each density value, 20 graphs were constructed 
so that, overall, 200 graphs were used in the experiments. Each graph had its 
vertices permuted and was then subject to a corruption process which consisted 
of randomly deleting a fraction of its nodes. In so doing we in fact obtain a graph 
isomorphic to a proper subgraph of the original one. Various levels of corruption 
(i.e., percentage of node deletion) were used, namely 0% (the pure isomorphism 
case), 10%, 20% and 30%. In other words, the order of the corrupted graphs 
ranged from 50 to 35. 

PBH was applied on each pair of graphs thus constructed and, after con- 
vergence, the percentage of matched nodes was recorded. Replicator dynamics, 
a class of dynamical systems developed in evolutionary game theory and other 
branches of mathematical biology, have recently proven remarkably powerful on 
simple versions of the graph matching problem, despite their inherent inability 
to escape from local optima [17,18,19]. For the sake of comparison, we therefore 
tested their effectiveness on our (more difficult) subgraph isomorphim task. 

Figure 1 plots the results obtained with both PBH and replicator dynamics. 
As can be seen, whenever no corruption was applied on the original graphs (i.e. in 
the case of isomorphic graphs), both methods found sistematically a maximum 
isomorphism, i.e. a maximum clique in the association graph (as far as replicator 
equations are concerned, this is indeed not surprising, as shown in [18]). The 
emerging picture did not change significantly for PBH when we did delete some 
nodes, whereas the replicator equations underwent a notable deterioration of 
performance. For the latter case, in fact, the curves (b)-(d) of Figure 1 have a 
peculiar “w” shape with a performance peak on 0.5-density graphs, where the 
corresponding association graphs have minimum density. 

It is also possible to compare our approach with the well-known Graduated 
Assignment method (GA) of Gold and Rangarajan [11]. Their algorithm is based 
on the minimization of an objective function which is significantly different from 
ours. In [11] (Figure 8) they present results of applying GA on 100-node random 
graphs with density values ranging from 4% to 28%, and various corruption 
levels up to 30%. From their results a significant sensitivity of GA to node 
deletion emerges, similarly to what happens for the replicator dynamics, but 
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(a) 



|G2l=0,8|G,|=40 (G)=40 




(c) 



|G2l=0,9|Gil=45 (G)=45 




density 

(b) 



|G2l=0,7|Gi|=35 {G)=35 




density 

(d) 




+ Replicator equations x PBH 



Fig. 1. Results of matching 50-node random graphs, with varying levels of corruption, 
using PBH and replicator dynamics. The x-axis represents the (approximate) density 
of the matched graphs, while the j/-axis represents the percentage of correct matches. 
Here w is the size of the maximum clique of the association graph, i.e. the size of 
the maximum isomorphism, and \C\ is the size of the isomorphism returned by the 
algorithms, i.e., the size of the maximal clique found. Figures (a) to (d) correspond 
to different levels of corruption, i.e. 0%, 10%, 20% and 30%, respectively. All curves 
represent averages over 20 trials. 



in a less pronounced manner. In contrast, the performance of PBH (on smaller 
graphs) seems to be more insensitive to the corruption level, a feature which 
is clearly desirable. Notice, however, that the results tend to degrade slowly 
for the denser association graphs that arise for densities close to 0 and 1. This 
phenomenon tends to strengthen up slowly as more nodes are deleted, but the 
average efficiency never goes below 85%. This figure is superior to those obtained 
with replicator dynamics and GA. 

A remarkable empirical finding was that Algorithm 3.1 never failed to return 
a maximal clique. We then never needed to perturb the final point in order to 
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reach a nearby local minimizer. We tried to find exceptions by running PBH on 
random graphs with non-clique regular subgraphs. The latter subgraphs corre- 
spond in fact to stationary points of program (1) as shown in [5]. Hundreds of 
experiments were conducted on random instances with different degrees of noise, 
but PBH never failed to return a maximal clique. At the moment we cannot give 
a formal proof of this fact. 

5 Conclusions 

Motivated by a recent quadratic formulation, we have presented a pivoting-based 
heuristic for the graph matching problem based on the corresponding linear com- 
plementarity problem. The preliminary results obtained are very encouraging 
and indicate the proposed framework as a new promising way to tackle graph 
matching and related combinatorial problems. Note also that our algorithm is 
completely devoid of working parameters, a valuable feature which distinguishes 
it from other heuristics proposed in the literature 

Clearly, more experimental work needs to be done in order to fully assess 
the potential of the method. Also, generalizations of the proposed approach 
to attributed graph and error-tolerant (many-to-many) matching problems are 
possible, along the lines suggested in [17,3,22]. All this will be the subject of 
future work. 
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Abstract. In this paper, a new method for reconstructing 3-D shapes is 
proposed. It is based on an active stereo vision system composed of a camera 
and a light system which projects a set of structured laser rays on the scene to 
he analyzed. The depth information is provided by matching the laser rays and 
the corresponding spots appearing in the image. The matching task is performed 
by using Genetic Algorithms (GAs). The process converges towards the 
optimum solution which proves that GAs can effectively be used for this 
problem. An efficient 3-D reconstruction method is introduced. The 
experimental results demonstrate that the proposed approach is stable and 
provides high accuracy 3-D object reconstruction. 



1 Introduction 

Reconstructing 3-D shapes is a very important issue for many applications in 
computer vision and computer graphics, namely object recognition for robotic vision, 
virtual environment construction, and so on. In order to obtain the three-dimensional 
surface information of an object, a passive or active stereo vision method is usually 
used. The most commonly employed passive method consists in taking two images of 
a scene at two different shooting angles by using either two cameras or only one 
camera from two different positions. Then, two-dimensional coordinates in the two 
images are determined. Supposing that both, the geometrical relationship between the 
two cameras (or the displacement of the unique camera) and their intrinsic parameters 
are known, 3-D coordinates of a point can be deduced from the two-dimensional 
coordinates by triangulation. Although this method is accurate and often used, it 
requires high-performance image processing tasks such as extracting the points to be 
reconstructed from one of the two images and searching along the epipolar line for 
their corresponding points in the second image [1]. This process is defined as stereo 
image matching and remains one of the bottlenecks in computer vision [2] [3] [4] [5]. 

Active stereovision systems offer an alternative approach to the use of two 
cameras. They consist in replacing one of the two cameras by a light system (a laser 
emitter) which projects either one light ray or a set of structured light rays onto the 
scene. In the first case, a sequence of images is taken by the camera as the light ray 
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scans the scene whereas in the seeond ease, only one image is taken. In this study, we 
are concerned with the second case. An image thus obtained has many spots created 
by the laser rays. Supposing that both the geometrical relationship between the 
camera and the light projection system, and the intrinsic parameters of the system are 
known, 3-D coordinates of a spot in the image can be determined and provide the 
corresponding depth information. This requires identifying the laser ray from which 
the 2-D spot originates. In this case, the stereo matching problem boils down to 
matching the laser rays and their corresponding spots in the image reference. It is a 
hard combinatorial search problem which is difficult to solve with a conventional 
optimization method. We propose to tackle this problem by using Genetic Algorithms 
(GAs). 

The remainder of this paper is organized as follows: In the next section, we briefly 
present our active stereovision system, then we describe the calibration of the system 
and the 3-D reconstruction process. In section 3, we present the principal elements of 
the proposed GAs for matching laser rays and points. In section 4, we describe the 
partitioning process. Data are distributed into regions and GAs are independently 
applied in each cluster. In section 5, we show experimental results and in seetion 6 the 
conclusion. 

2 The Active Stereovision System 

2.1 System Description and Calibration 

Our active stereovision system is composed of the following elements: A CCD 
camera connected to a PC which allows image aequisitions, and a laser diode 
connected to a diffraction grid for the generation of a laser array composed of 361 
(19x19) rays directed so that the angle between two consecutive rays is equal to 
0.77°. Each ray is indexed by two indices m and n which vary from -9 to 9. In order 
to reconstruct the 3-D surface of an object, one only needs to take an image of the 
object illuminated by the laser array. The intersection of each ray with the object 
produces a spot on the object. Thus, the camera provides an image in which many 
points are lit (see Fig. 1). 




laser projector 



image obtained 



1 



CCD camera 



Fig. 1. An outlook of the active stereo vision system. 
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3-D coordinate points on the illuminated seene can be determined only if the 
system is calibrated. Instead of using a classical calibration method that requires the 
determination of the matrices which characterize the intrinsic and extrinsic parameters 
of the system [1], we use a simpler methodology. It consists in taking a sequence of 
images of an illuminated plane which moves between two fixed positions. During the 
translation the plane must stay perpendicular to the main ray (m = n = 0) (see Fig. 2). 
Knowing both the characteristics of the acquisition system, the position of the plane 
for each image of the sequence, and the spot coordinates in each image, a set of 
calibration parameters can be deduced [6]. Thus, for each ray (m, n)(.g<„ < », .» <„ < »; of 
the beam we will have: 

1) the parameters A„„ and B„„ of the straight line A„„ whieh is the projection of 
the ray (m, n) on the image reference; 

2) the parameters C^„, D„„ and which model the hyperbolic curve F„„. It 
describes the depth of the successive spots created by the ray (m, n) along the 
image sequence. F„„ is a function of the spot projections on A„„. 

The calibration is performed only one time before acquiring images of objects to 
be analyzed. The parameters then provided will be used for 3-D reconstruetion. 




Fig. 2. The calibration process. 



2.2 3-D Reconstruction 

For each spot visible on an image, the 2-D coordinates of its center are measured. The 
corresponding three-dimensional coordinates of the spot can be calculated as follows 
[7]: Let be the center of a given spot. If is connected to the straight line A„„, we 
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calculated pj,', the orthogonal projection of on the straight line A„„. Substituting the 
coordinates of p^' in the function F„„ gives the depth Zj of the 3-D point whose 
projection in the image is p^. The 3-D coordinates (X^., Y„ Z,) of Pj. are expressed in 
the absolute referential (OXYZ) linked to the illuminated scene (the Z-axis 
corresponds to the main laser ray (m = n = 0)). If cp is the inter-ray angle, the indices 
of the straight lines can be chosen so that the corresponding ray be at an angle 
m-(p around Y and at an angle n-<p around X(see Fig. 3). 

From these hypotheses, we have: 





(la) 


Yk = {Lo-Z,)tan{n(p) 


(lb) 



where Lg is the distanee between the plane (XOY) and the laser emitter. 




Fig. 3. Three dimensional reconstruction 



By scanning the image, this operation is repeated for all the spots appearing in the 
image. The 3-D points calculated serve as a basis for 3-D object shape reconstruction. 

The problem at hand is then to match each spot in the image with its corresponding 
laser ray. We propose to solve this matching task by using GAs. 



3 Matching Laser Rays and Points Using GAs 

GAs proposed by Holland [8] are adaptive procedures that find solutions to problems 
by using an evolutionary process based on natural selection. Their application in 
image proeessing and pattern recognition has become widespread in the last decade 
[4][5][9][10][11]. 
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A GA is an iterative procedure which uses a population of potential solutions to a 
problem. Each individual solution is encoded as a chromosome made up of a string of 
genes which may take one of several values called alleles. 

Each iteration consists of three main stages, namely evaluation, selection and 
mating. In the evaluation stage, each chromosome is assigned a fitness value which 
represents its ability to solve the problem. 

In the selection stage, chromosomes are picked based on their fitness in such a way 
that the better chromosomes are more likely to he selected. In the mating stage two 
processes are performed: pairs of selected chromosomes, also called parent 
chromosomes, are recombined by the crossover operation to form two new offspring. 

New offspring are also created by modifying one or more genes on the 
chromosomes randomly chosen in the mating pool. This operation is called mutation. 

From generation to generation, this process leads to increasingly better 
chromosomes and to near-optimal solutions. 

3.1 Coding 

In our study, the original data is made up of a set of points E, (i < ; < m) which are the 
centers of spots in the image reference and a set of lines which represent the 

2-D projections, in the image, of the laser rays cast by the light system. In order to 
constrain the solutions even further, we use a set of segments rather than 

lines. A segment is a part of a line limited hy the two points on the two planes located 
on the first and the last position introduced during the calibration process (see Fig. 2). 
The object to be analyzed must be placed within the region delimited by these two 
planes. It can be noted that the set of segments does not vary regardless of the object 
to be analyzed (the number of segments is always equal to the number of laser rays 
and each segment is always located in the same position). By contrast, the set of 
points depends on the object to be analyzed. In fact, the position of a given point on 
the image depends on the depth of its corresponding real object point and furthermore 
some rays can be missed due to the shape of the object and the position of the camera. 
By scanning the image, each point and each segment are sequentially labeled. 

Given N segments and M points {N > M), the chromosome of an individual is 
encoded as a permutation of N integers whose values are between one and N. The /th 
gene is the number of the point to be matched with the /th segment. Overlooked 
points are replaced by virtual points which are not taken into accoimt for the fitness 
value. 



3.2 Fitness 

Our goal, therefore, is to match a point with its closest segment. In theory, the point 
must be located exactly along the segment, but in practice, there is an error due to the 
noise effects and the process used to determine the spot center coordinates. Before the 
fitness function is stated formally, the error between a segment and its matched point 
needs to be defined. 

Given a segment S and its matched point P, we consider two cases for calculating 
the error: 

1) if the point is within the segment (see Figure 4a), the error is obtained by: 
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e = +6 — 7*1 

where m is the slope, b is the y-intercept of the segment, and x* and y* are the 
coordinates of the corresponding point. 

2) if the point is outside the segment (see Figure 4b), a penalty is applied to the 
error which is given by: 

where and are the distances between the point and the two edges of the segment. 

Finally, the fitness function of a Idh chromosome is the sum of the errors of all its 
genes and is expressed by the following equation: 




( 4 ) 



where k is the index of a chromosome, M is the number of points to be matched. 

According to this fitness function, the better the chromosome the lower its fitness 
value. 





(a) (b) 

Fig. 4. Determination of the error between a segment and its matched point, (a) The point P is 
within its matched segment S. (b) The point P is outside its matched segment S. 



3.3 Selection and Surviving Strategy 

We have applied a modified version of the selection strategy proposed by Z. 
Michalewicz [12]. This process can be described as follows: 

First, we select survivors among the individuals of the current generation for 
the next generation. This selection is based on a criterion called the “Surviving Test” 
and is defined by: 

(y;xc)</ (5) 

where ^ is the fitness of the rth chromosome, f is average fitness and c e [0.5, 1.5] 

is the survival testing coefficient randomly obtained. 

Thus, the number of survivors of each generation varies. Moreover, the above 
surviving test provides more variations of chromosomes in a population since a 
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chromosome with a high fitness value and a chromosome with an average fitness 
value have almost the same chance to survive. 

Secondly, by a tournament selection [13], (Np- Ng) parents are picked from the 
whole population of the current generation to conduct the crossover operation 
described below. The number of offspring produced depends on the crossover rate 
and is equal to (Np - Ng) x p^. 



3.4 Crossover 

In this study, we have applied the PMX operation [13]. Two parents are selected in 
the mating pool. First two locations are randomly chosen. Then the chosen portion of 
each parent is swapped with each other. And finally, the rest of the chromosome is 
copied preventing gene repetitions (see Fig. 5). 



[6|2|i'b' 3|7|5]4 9| |6'2|5[9l4|7jl|3 8 

parent 1 p yvoffapring 1 




parent 2 I— r N/offepring 2 

[ 2 IT i?79 [4 I 7 I 1 I 5 I e"] I 2 { 4 I ejsjs I 7 ; 5~] I'T 

1^ - y 



Fig. 5. PMX operator 



3.5 Mutation 

Mutation is used to prevent a convergence to a local optimum. In fact, crossover 
loses its role when the greater part of the population is centered around a local 
optimum of the fitness function. In this case, individuals at the same peak are often 
identical. So, the crossover operation does not significantly modify a chromosome. 
The mutation operation is a local mutation that shuffles the loci of genes in a 
randomly chosen block within the chromosome. 



4 Partitioning Method 

When using GAs, the size of chromosome is a drawback for computional time. The 
curve in Fig. 12 shows that the execution time of the crossover operation grows 
exponentially with the chromosome size. In [14] we have introduced a deterministic 
step to cope with this problem. However, since we can’t backtrack after matching 
some segments and points, an error occurring in this step may not be corrected in the 
GAs step. 
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Ri(x). 



, , I M . . I 

I ( 



(a) (b) 

Fig. 6. (a) Superimposed view of segments and points, (b) the member function. 



In a real acquisition process, according to a chosen configuration of the camera and 
the laser ray system, the segments and the points form a vertical (or horizontal) stripe 
pattern, and ambiguities in the matching process only occur in the vertical (or 
horizontal) direction (see Fig. 6 a). So, the data can be partitioned into independent 
regions. There are always 19 clusters since the laser array projects 19x19 beams on 
the scene which gives 19 independent matching problems to solve. 

This partitioning process provides two main advantages: First, the size of a 
chromosome is always the same (a chromosome is composed of 19 genes) whatever 
the complexity of the object to be analyzed. Second, the search space is limited since 
the matching process is performed locally in each cluster which increases the 
convergence. 

4.1 Clustering the Data 

The simplest way of getting regions is by using lines, referred to as the dividers, that 
pass through the center of each cluster of segments. Since these dividers are not 
parallel, we have defined a customized member function for each region for the 
purpose of clustering the points. 

The customized member function R,{x) of the ith region is expressed as (see Fig. 
6 b): 

, , 1 1, X, , <x<x, with 1<(<19 (6) 

i?,(x) = i 

[0, otherwise 

where x,.; and x, are the x-values of respective corresponding points on the dividers. 
Note that Xq = 0 and X 19 = 

For example, let’s consider a point P and its coordinates {p, q). The equation of the 
fth divider is expressed as y, = mpc + 6 ,. By replacing y, with q in the above formula, 
we obtain the corresponding x,. Then, x,_; is calculated in the same way by considering 
the equation of the (/ - 7)th divider. According to the value of p which corresponds to 
X in the member function (eq. 6 ), P may be included or not in the rth region. 

4.2 Chromosome Coding and Multi-Population 

The points have been distributed into 19 independent regions which have, for each 
one, 19 points and 19 segments. Thus a chromosome consists of a permutation of 19 
integers that represent the indiees of points. Note that if there are some overlooked 




488 Sanghyuk Woo, Albert Dipanda, and Franck Marzani 




Fig. 7. Outlook of the partitioning method 



points, we introduce virtual points to complete the chromosome. So we have 19 
problems that can be solved in parallel. We work with a population PP partitioned 
into PP^is [ 1 , 19 ]) sub-populations whose size is the same (pop_size). Thus we have the 
following organization: PP={PPi,PP2,PPi,...,PPn,PP\9}, PPr{Cn,Ca,---Apop_Mze}, 
and C,-,={5'i,5'2,...,5'i8,iS'i9}, where Cij is the j\h chromosome of the ith sub-population, 
5, are the segments of the zth region, /e [1,19] and ye [1,19]. 

The Genetic operations described above are then applied to each sub-population 
independently (see Fig. 7). 

5 Experimental Results 

In this section, we present experimental results of our method. A large number of 
experiments were condueted to show the validity of our approach for 3-D shape 
reconstruction. 

Fig. 9, Fig. 10, and Fig. 11 illustrate 3-D reconstruction of three objects: a biplane, 
a computer mouse and a human face, which constitute a meaningful sample of the 
objects we used. We have placed a plane behind the biplane and the computer mouse 
during their image acquisition. The objective was to recover all the laser spots. 
Whereas face image acquisitions were done in a more realistic context, i.e. without 
any additional plane. Thus, only the laser rays which intersect the face may appear on 
the images. Moreover, we have acquired an image sequence of the face during a 
rotation movement. 

There were 3 overlooked points on the biplane image and 1 1 overlooked points on 
the computer mouse image. But on the face images, about 200 spots, out of the 361 
possible, were visible. 

The experiments were always conducted until an optimal solution was reached. 
Since GAs are stochastic, the number of generations may vary at different runs. The 
mean generation number was 200 whatever the object to reconstruct since all the 
matching problems were similar. 



Application of Genetic Algorithms to 3-D Shape Reconstruction 489 




Fig. 8. Images obtained with the active stereo vision system - aperture of the camera is chosen 
in order to viewing only the spots. The biplane (left) and the computer mouse (right). 




Fig. 9. 3-D Reconstruction of a biplane - A mesh view (left) and a shaded surface view (right) 
in a different angle 



The major concern in our approach is the selection of GAs parameters to obtain the 
final results with high speed convergence. An experimental study shows that the 
population size is the most important parameter. On one hand, if the population size is 
too small, it may cause a premature convergence to a suboptimal solution. On the 
other hand, a large population size requires more evaluations per generation but 
allows better convergence to an optimal solution. Fig. 13 shows the curve of 
execution time versus the population size. It appears that the optimal population size 
is 30. According to our experimental study, the importance of the other parameters is 
not very significant. The crossover rate was set to 0.8 and the mutation rate was set to 
0 . 1 . 

For each object, more than 100 different runs for 3-D reconstruction were carried 
out giving the same final result, which proves that our method is stable. Fig. 14 
represents an example of the evolution of the fitness of the best chromosome in each 
of the 19 sub-populations. Fig. 15 illustrates the fitness value evolution, until 
convergence, of all individuals in a sub-population at each generation. It appears that 
the process converges rapidly. 

The different results obtained show that our approach permits the recovery of 3-D 
coordinates of spots and consequently it enables 3-D object reconstruction with high 
accmacy. 
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Fig. 10. 3-D Reconstruction of a computer mouse - A mesh view (left) and a shaded surface 
view (right) in a different angle. 







Fig. 11. 3-D Reconstruction of a face which is rotating 
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Chromosome size 



Fig. 12. Execution time for 500 crossover operations vs. chromosome size (PMX) 




Fig. 13. Plot of average execution time versus population size for computer mouse. 




Fig. 14. Plot of fitness evolution of the best chromosome in each subpopulation 
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Fig. 15. Plot of the fitness evolution of all individuals in a subpopulation 



6 Conclusion 

In this paper, a new method for reconstructing 3-D shapes is proposed. It is based on 
an active stereo vision system composed of a camera and a light system which 
projects a set of structured laser rays on the scene to be analyzed. The image acquired 
by the system has many spots created by the laser rays. These spots are the basis of 
the 3-D shape reconstruction. The determination of the depth information on a point, 
represented by a spot in the image, requires to identify the laser ray from which the 
spot originates. The matching task is performed using GAs. Data is partitioned into 
independent clusters and GAs are performed independently in the different clusters. 
In all cases, the matching processes converge towards the optimum solution which 
proves that GAs can effectively be used to this matching problem. The 3-D 
reconstruction is performed by an efficient method. The experimental results confirm 
that the proposed approach is stable and provides high accuracy 3-D object 
reconstruction. Consequently, it can be used efficiently as the first stage in a global 3- 
D shape system analysis. 
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Abstract. A Markov process model for contour curvature is introduced 
via a stochastic differential equation. We analyze the distribution of such 
curves, and show that its mode is the Euler spiral, a curve minimizing 
changes in curvature. To probabilistically enhance noisy and low con- 
trast curve images (e.g., edge and line operator responses), we combine 
this curvature process with the curve indicator random field, which is 
a prior for ideal curve images. In particular, we provide an expression 
for a nonlinear, minimum mean square error filter that requires the so- 
lution of two elliptic partial differential equations. Initial computations 
are reported, highlighting how the filter is curvature-selective, even when 
curvature is absent in the input. 



1 Introduction 

Images are ambiguous. One unpleasant consequence of this singular fact is that 
we cannot compute contours without making assumptions about image struc- 
ture. Inspired by Gestalt psychology [16], most previous work has defined this 
structure as good continuity in orientation, that is to say, curves with vary- 
ing orientation — high curvature — are rejected, and, conversely, straighter curves 
are enhanced. This is naturally phrased in terms of an energy functional on 
curves that minimizes curvature. In this paper, we present a stochastic model 
that instead aims to enforce good continuation in curvature, and thus minimizes 
changes in curvature. 

To understand why we believe that good continuation in curvature is im- 
portant, imagine the situation of a bug trying to “track” the contour in Fig. 1. 
Suppose the bug is special in that it can only “search” for its next piece of con- 
tour in a cone in front of it centered around its current predicted position and 
direction (i.e., orientation with polarity) [24,6]. This strategy is appropriate so 
long as the contour is relatively straight. However, when the bug is on a portion 
of the contour veering to the right, it will constantly waste time searching to 
the left, perhaps even mistracking completely if the curvature is too large. In 
estimation terms, the errors of our searching bug are correlated, a tell-tale clue 
that the assumption that the contour is straight is biased. A good model would 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 497-512, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




498 



Jonas August and Steven W. Zucker 



b 





Fig. 1. Mistracking without curvature. A bug (grey dot) attempts to track the contour, 
“looking” in the cone of search directions centered around its current direction. At point 
(a), the curve is straight and the bug is successful, although at (b), curve is veering 
to the right and the bug can barely still track. At (c), the curvature is so high that 
tracking fails. A better model would explicitly include the curvature of the contour, 
giving rise to a “bent” search cone (d) for the bug. The same difficulty arises in contour 
enhancement, which is the application considered in this paper. 



only lead to an unavoidable “uncorrelated” error. We present a Markov process 
that models not only the contour’s direction, but also its local curvature. 

It may appear that one may avoid these problems altogether by allowing a 
higher bound on curvature. However, this forces the bug to spend more time 
searching in a larger cone. In stochastic terms, this larger cone is amounts to 
asserting that the current (position, direction) state has a weaker influence on 
the next state; in other words, the prior on contour shape is weaker (less peaked 
or broader). But a weaker prior will be less able to counteract a weak likelihood 
(high noise): it will not be robust to noise. Thus we must accept that good 
continuation models based only on contour direction are forced to choose between 
allowing high curvature or high noise; they cannot have it both ways^. 

Although studying curvature is hardly new in vision, modeling it probabilis- 
tically is. In [3,25,15] and [8, 361 ff.], measuring curvature in images was the 
key problem. In [14], curvature is used for smooth interpolations, following on 
the work on elastica in [20,10] and later [18]. The closest work in spirit to this 
is relaxation labeling [26], several applications of which include a deterministic 
form of curvature [19,11]. Markov random fields for contour enhancement using 
orientation [17] and co-circularity [9,21] have been suggested, but these have no 
complete stochastic model of individual curves. The explicit study of stochastic 
but direction-only models of visual contours was initiated by Mumford [18] and 
has been an extended effort of Williams and co-workers [22,23]. 

This paper is organized as follows. We first introduce a curvature random 
process and its diffusion equation; we then present example impulse responses, 
which act like the “bent” search cones. Second, we relate the mode of the dis- 
tribution for the curvature process to an energy functional on smooth curves. 
Next we review a model of an ideal curve image (e.g., “perfect” edge operator 

^ Observe in the road-tracking examples in [6] how all the roads have fairly low cur- 
vature. While this is realistic in flat regions such as the area of Prance considered, 
others, more mountainous perhaps, have roads that wind in the hillsides. 
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responses) called the curve indicator random field (CIRF), which was introduced 
in [2] and theoretically developed in [1], but in the context of Mumford’s direc- 
tion process [18]. Here we apply the CIRF to the curvature process, and then 
report a minimum mean square error filter for enhancing image contours. We 
conclude with some example computations. 



2 A Brownian Motion in Curvature 

Recall that a planar curve is a function taking a parameter f G K to a point 
{x{t),y{t)) in the plane Its direction is defined as 6 = arg(i; + ^/^ y), 
where the dot denotes differentiation with respect to the arc-length parameter 
t (x^ + = 1 is assumed). Curvature k is equal to 0, the rate of change of 

direction. 

Now we introduce a Markov process that results from making curvature a 
Brownian motion. Let R(t) = (X,Y,&,K)(t) be random^, with realization r = 
(x,y,0,K) G X S X M. Consider the following stochastic differential equation: 

X = cos 0, Y = sin 0, 0 = K, dK = adW, 

where a = is the “standard deviation in curvature change” (see §3) and 
W denotes standard Brownian motion. The corresponding Fokker-Plank partial 
differential equation (PDF), describing the diffusion of a particle’s probability 
density, is 



dp 



8‘^p 

2 9^2 

(7^ d^p 
2 8k'^ 



ndp ■ Q^P 

cosu— sm el- 

ax dy 




(cos 9, sin 9, k, 0) • Vp, 



( 1 ) 



where p = p{x,y,9, K,t) = p{R{t) = r|R(0) = ro), the conditional probability 
density that the particle is located at r at time t given that it started at ro at time 
0. Observe that this PDF describes probability transport in the (cos 6, sin 9, k, 0)- 
direction at point r = (x, y, 9, k), and diffusion in k. An extra decay term [18,22] 
is also included to penalize length (see §3). We have solved this parabolic equa- 
tion by first analytically integrating the time variable and then discretely com- 
puting the solution to the remaining elliptic PDF. Details will be reported in [1]. 
See Fig. 2 for example time- integrated transition densities. 



^ Capitals will often be used to denote random variables, with the corresponding 
letter in lower case denoting a realization. However, capitals are also used to denote 
operators later in the paper. 




500 



Jonas August and Steven W. Zucker 




Fig. 2. Curvature diffusions for various initial curvatures. For all cases the initial po- 
sition of the particle is an impulse centered vertically on the left, directed horizontally 
to the right. Shown is the time integral of the transition density of the Markov process 
in curvature (1), integrated over direction 9 and curvature k\ therefore, the brightness 
displayed at position (x, y) indicates the expected time that the particle spent in (x, y). 
(Only a linear scaling is performed for displays in this paper; no logarithmic or other 
nonlinear transformation in intensity is taken.) Observe that the solution “veers” ac- 
cording to curvature, as sought in the Introduction. Contrast this with the straight 
search cone in Fig. 3. The PDF was solved on a discrete grid of size 32 x 32 x 32 x 5, 
with (Tyi = 0.01 and an exponential decay of characteristic length A = 10 (see §3 for 
length distribution). 



3 What Is the Mode of the Distribution 
of the Curvature Random Process? 



To get more insight into our random process in curvature, consider one of the 
simplest aspects of its probability distribution: its mode. First, let us consider 
the situation for Mumford’s direction-based random process in^ x S, or 

X = cos 6>, Y = sin 6>, dO = dW, 



where cr^ is the “standard deviation in curvature” and W is standard Brownian 
motion. This process has the following Fokker-Plank diffusion equation: 



dp 

dt 



d^p 



dp 



dp 



n Jr • n^j' 



( 2 ) 



where p = p{x, y, 0, t) is the transition density for time t. As Mumford has 
shown [18], the mode of the distribution of this direction process is described by 
elastica, or planar curves that minimize the following functional: 



I {aK^ + (3)dt, 



( 3 ) 



where a and (5 are nonnegative constants. With such an elegant expression for 
the mode of the Mumford process, we ask: Is there a corresponding functional 
for the curvature process? If so, what is its form? 

^ X § is also called (x, y, 6)-space, the unit tangent bundle, and orientation 
space [13]. 
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: 0° 33° 

Fig. 3. Solutions to Mumford’s diffusion. Equation (2) was integrated over time and 
then solved for a slightly blurred impulse on a 80 x 80 x 44 grid, with parameters 
On = 1/24, A = 100, and at discrete directions 0 (left) and 4 (right). Depicted is 
the integral over 9, cropped slightly. The method used [1] responds accurately at all 
directions. Note that these responses are straight, analogous the search cone described 
in the Introduction. Given their initial direction, particles governed by the direction 
process move roughly straight ahead, in contrast to those described by our curvature 
process (Fig. 2). 



To answer these questions, we follow a line of analysis directly analogous to 
Mumford [18]. First, we discretize our random curve into N subsections. Then 
we write out the distribution and observe the discretization of a certain integral 
that will form our desired energy functional. 

Suppose our random curve from the curvature process has length T, dis- 
tributed with the exponential density p{T) = exp(— A“^T), and independent 

of the shape of the contour. Each step of the A’-link approximation to the curve 
has length At := T/N . Using the definition of the t-derivatives, for example. 



X 



dX 

dt 



lim 

N-^oo 



X,, 



1+1 



X,; 



T/N 



we can make the approximation X^+i w Xj + AtX. Recalling the stochastic 
differential equation (1), we therefore let the curvature process be approximated 
in discrete time by 



Xi+i ^ Xi + At cos Oi, Yi+i =Y^ + At sm0^, Oi+i = + AtK^, 



where i = 1, . . . ,N. Because Brownian motion has independent increments 
whose standard deviation grows with the square root of the time incre- 

ment At, the change in curvature for the discrete process becomes 

f^l+l = £j, 

where {ci} is an independent and identically distributed set of 0-mean, Gaussian 
random variables of standard deviation ct = Let the discrete contour be 
denoted by 



rN = {{X,,Y,,e,,K,):t = 0,... ,N}. 
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Given an initial point po = J/Oi ^o> «o), the probability density for the other 
points is 



p(riv|po) = A ^exp(-A ^T) • 
which, by substitution, is proportional to 



exp 



E l f Kj+l — K,_ 



At 



At-\-^T 



We immediately recognize as an approximation io ^ and so we 

conclude that 

p(-^ivlpo) -t p{r\po) oc as TV ^ oo, 

where the energy -E(-T) of (continuous) curve F is 

E{F) = J {ak^ + P)dt, (4) 



where a = and /3 = 

Maximizers of the distribution p{F) for the curvature random process are 
planar curves that minimize of the energy functional E{F). Such curves are 
known as Euler spirals, and have been studied recently in [14]. A key aspect of 
the Euler spiral functional (4) is that it penalizes changes in curvature, preferring 
curves with slowly varying curvature. In contrast, the elastica functional (3) 
penalizes curvature itself, and therefore allows only relatively straight curves, to 
the dismay of the imaginary bug of the Introduction. 



4 Filtering and the Curve Indicator Random Field 

Given our stochastic shape model for contours, we now introduce the curve 
indicator random field (GIRF), which naturally captures the notion of an ideal 
curve image, and provides a precise definition for the kind of output we would 
like from an edge operator, for example. Roughly, this random field is non- 
zero-valued along the true contours, and zero-valued elsewhere. The actually 
measured edge/line map is then viewed as an imperfect GIRF, corrupted by 
noise, blur, etc. The goal of filtering, then, is to estimate the true GIRF given 
the imperfect one. For completeness, we review the theory of the GIRF now; 
proofs and more details can be found in [1]. 



4.1 Definitions 

For generality, we shall define the curve indicator random field for any con- 
tinuous-time Markov process Rt,0 < t < T taking values in a finite (or at 
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most countable) set X of cardinality \L\. As in §3, the random variable T is 
exponentially-distributed with mean value A > 0, and represents the length of 
a contour. To ensure the finiteness of the expressions that follow, we assume 
A < oo. Sites or states within X will be denoted i and j. (Think of X as a 
discrete approximation to the state space 7?. = x S x M where the curvature 
random process takes values.) Let lL{condition} denote the (indicator) function 
that takes on value 1 if condition is true, and the value 0 otherwise. With these 
notations we define the curve indicator random field V for a single curve to be 

V) := / l{i?t = i)dt, VI e X. 

Jo 

Observe that Vi is the (random) amount of time that the Markov process spent 
in state i. In particular, U) is zero unless the Markov process passed through 
site i. In the context of Brownian motion or other symmetric processes, V is 
variously known as the occupation measure or the local time of Rt [4,5]. 

Generalizing to multiple curves, we pick a random number W, Poisson dis- 
tributed with average value AC. We then choose AC independent copies . . . , 
R^^^ of the Markov process Rt, with independent lengths Ti, . . . each dis- 
tributed as T. To define the (multiple curve) CIRF, we take the superposition 
of the single-curve CIRFs . . . , for the AC curves. 

Definition 1, The curve indicator random field U is defined to be 
N U ,.T„ 

U, := = Vdtn, Vi e X. 

n=l 

Thus Ui is the total amount of time that all of the Markov processes spent in 
site i. Again, observe that this definition satisfies our desiderata for an ideal 
edge/line map: (1) non-zero value where the contours are, and (2) zero-value 
elsewhere. The probability distribution of U will become our prior for inference. 



4.2 Statistics of the Curve Indicator Random Field 



Probabilistic models in vision and pattern recognition have been specified in a 
number of ways. For example, Markov random field models [7] are specified via 
clique potentials and Gaussian models are specified via means and covariances. 
Here, instead of providing the distribution of the curve indicator random field 
itself, we report its moment generating functional, from which all moments can 
be computed straightforwardly. 

Before doing so, we need to develop more Markov process theory. We first 
define the inner product {a,h) = generator of the Markov process 

Rt is the |X| x |X| matrix L = and is the instantaneous rate of change of the 
probability transition matrix P[t) = {pij){t) for Rf. For the curvature process, 
we let X be a discretization of the partial differential operator on the right hand 
side of (1), or 



L 









(curvature process). 
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and for the direction process, L is the discretization of the corresponding operator 
in (2), or 



L 



2 302 



.9 d 

cos 07 - sin 07 — 

ox ay 



(direction process). 



To include the exponential distribution over T (the lifetime of each particle) , we 
construct a killed Markov process with generator Q = L — (Formally, we do 
this by adding a single “death” state rti to the discrete state space I. When t >T, 
the process enters ifi and it cannot leave.) Slightly changing our notation, we shall 
now use Rt to mean the killed Markov process with generator Q. The Green’s 
function matrix G = {dij) of the Markov process is the matrix = 

P{t)e~^^^dt, where P{t) = (e"^ denotes the matrix exponential of matrix 

A). The (i, j)-entry gij in the Green’s function matrix G represents the expected 
amount of time that the Markov process spent in j before death, given that the 
process started in i. One can show that G = Given a vector c G 

having sufficiently small entries Cj, we define the Green’s function matrix G(c) 
with spatially-varying “creation” c as the Green’s function matrix for the killed 
Markov process with extra killing — c, i.e., having generator Q{c) := Q + diage, 
where diag c is a diagonal matrix with the entries of vector c along the diagonal; 
in particular, G(c) = —Q{c)~^ = —{Q + diagc)“^. 

Recalling that each Markov process . . . ,R-t^'^ is distributed as Rt, let 
the joint distribution of the initial and final states of Rt be 



F{Ro = i, Rt- = j} = Vi, j G J, 

where g = (/x^) and u = (ui) are vectors in RVI weighting initial and final 
states, respectively, and f{T—) is the limit of f{t) for t approaching T from 
below. Therefore, vectors g and v must satisfy the normalization constraint 
{yL,Gv) = 1. For general purpose contour enhancement, we typically have no 
a-priori preference for the start and end locations of each contour, and so we set 
these vectors proportional to the constant vector 1 = (1, . . . ,1). For example, 
by setting jj, = = A“^l, the normalization constraint is satisfied. 

The following key theoretical result is used in this paper but is developed 
in [1] , and is most closely related to the work of Dynkin [4] . 

Proposition 1. The moment generating functional of the curve indicator ran- 
dom field U is 



Eexp(c, U) = exp(/x, A/"(G(c) — G)^). 

While this result may seem abstract, it is actually very useful. Let G* denote the 
transpose of G. From Prop. 1 we obtain the first two cumulants of the CIRF [2]: 

Corollary 1. Suppose the g = = A^^l. The mean of the curve indi- 

cator random field U is E f/j = WA|I|^^, Vi G I. The covariance matrix of U is 
coyU = JfiX\I\-^{G + G*). 
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Several “columns” of the covariance matrix for the curvature process are illus- 
trated in Fig. 4, by taking its impulse response for several positions, directions 
and curvatures. In addition, using Prop. 1 one can compute the higher-order 




K0 -. 0.2 0 - 0.1 

Fig. 4. Impulse responses of the covariance matrix for the curve indicator random field 
for the curvature process. Impulses are located at the center of each image, directed at 
discrete direction 4 out of 32, with 5 curvatures. Parameters are = 0.01, A = 10. 



cumulants of U, and they are generally not zero, which shows that the curve 
indicator random field is non-Gaussian [1]. Despite that, its moment generating 
functional has a tractable form that we shall directly exploit next. 

5 Minimum Mean Square Error Filtering 

Instead of the unknown random field U, what we actually observe is some real- 
ization m of a random field M of (edge or line) measurements. Given m, we seek 
that approximation u of U that minimizes the mean square error (MMSE), or 

u = argminE^llu — t/|p, 

U 

where the denotes taking an expectation conditioned on the measurement 
realization m. It is well-known that the posterior mean is the MMSE estimate 

u — E^C/, 

but in many interesting, non-Gaussian, cases this is extremely difficult to com- 
pute. In our context, however, we are fortunate to be able to make use of the 
moment generating functional to simplify computations. 

Before developing our MMSE estimator, we must define our likelihood func- 
tion p{M\U). Eirst let Hi be the binary random variable taking the value 1 
if one of the contours passed through (or “hit”) site i, and 0 otherwise, and 
so is a binary random field on I. In this paper we consider conditionally 
independent, local likelihoods: p{M\H) = p{Mi\Hi) ■ ■ ■ p{M\x\\H\x\). Eollow- 
ing [6,24], we consider two distributions over measurements at site i: Pon{Mi) := 
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p{Mi\Hi = 1) and Pos{Mi) := p{Mi\Hi = 0). It follows [24] that \np{M\H) = 
Y^^\n{pon{Mi)/pofi{Mi))Hi. Now let r be the average amount of time spent by 
the Markov processes in a site, given that the site was hit; observe that U^/t 
and Hi are therefore equal on average. This suggests that we replace H with 
U /t above to generate a likelihood in ? 7 , in particular, 

\np{M\U) « = (c, [/), where c* = Cj(Mi) = r~Mn ■ 

As shown in [1], the posterior mean in this case becomes 

^mU^ « ^(Eexp(c, U)){c) = UfA, V* e X, (5) 

where / = (/i, ■ ■ ■ , f\x\) is the solution to the forward equation 

(Q + diagc)/ + 1 / = 0 (6) 

and &=(&!,... , h\x\) is the solution to the backward equation 

(Q* + diagc)6 + // = 0. (7) 

Note that equations (6) and (7) are linear systems in however, since Q = 
where the generator L is the discretization of an (elliptic) partial differ- 
ential operator, we view the forward and backward equations as (linear) elliptic 
partial differential equations, by replacing L with its undiscretized counterpart, 
and c, /, b, p. and v with corresponding (possibly generalized) functions on a con- 
tinuum, such as X § for the direction process, or x S x K for the curvature 
process. 

Observe that two nonlinearities arise in this posterior mean in equation (5). 
First, there is a product of the forward and backward solutions^. Second, al- 
though both the forward and backward equations are linear, they represent non- 
linear mappings from input (c) to output (/ or h). For example, it follows that 
f = {I — G diag c)^Gi^, i.e., / is a polynomial — and thus 

nonlinear — function of the input c. 



5.1 Example Computations 

We have implemented a preliminary version of the CIRF posterior mean filter. 
For our initial experimentation, we adopted a standard additive white Gaus- 
sian noise model for the likelihood p{M\U). As a consequence, we have c(m) = 
7 im — 72 , a simple transformation of the input m, where 71 and 72 = 3 are con- 
stants. The direction-dependent input m was set to the result of logical/linear 
edge and line operators [12]. The output of the logical/linear operator was lin- 
early interpolated to a many directions as necessary. For direction-only filtering. 



^ This is analogous to the source/sink product in the stochastic completion field [22]. 
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this input interpolation was sufficient, but not for the curvature-based filtering, 
as curvature was not directly measured in the image] instead, the directed logi- 
cal/linear response was simply copied over all curvature values, i.e., the input m 
was constant as a function of curvature. The limit of the invertibility of Q+diag c 
and its transpose was used to set 71 [1]. The technique used to solve the forward 
and backward CIRF equations for the curvature process in (x, y, 0 ,k) will be 
reported in [1] , and are a generalization of the method used for the forward and 
backward CIRF equations for the Mumford (direction) process in {x,y,9) [1], 
which is also used here. Parameter settings were M = 1, /j, = |X|^^1, n = A^^l. 
The direction-process based CIRF was solved on a grid the size of the given 
image but with 32 directions. For the curvature-based CIRF filtering, very few 
curvatures were used (3 or 5) in order to keep down computation times in our 
unoptimized implementation. Unless we state otherwise, all filtering responses 
(which are fields over either discrete (x, y, 0)-space or discrete (x, y, 0, K)-space) 
are shown summed over all variables except (x,y). 

For our first example, we considered a blood cell image (Fig. 5, top). To illus- 
trate robustness, noise was added to a small portion of the image that contained 
two cells (top left), and was processed with the logical/linear edge operator at 
the default settings^. The result was first filtered using the CIRF posterior mean 
based on Mumford’s direction process (top center). Despite using two very differ- 
ent bounds on curvature, the direction-based filtering cannot close the blood cell 
boundaries appropriately. In contrast, the CIRF posterior mean with the curva- 
ture process (top right) was more effective at forming a complete boundary. To 
illustrate in more detail, we plotted the filter responses for the direction-based 
filter at = 0.025 for 8 of the 32 discrete directions in the middle of Fig. 5. The 
brightness in each of the 8 sub-images is proportional to the response for that par- 
ticular direction as a function of position (x, y). Observe the over-straightening 
effect shown by the elongated responses. The curvature filter responses were plot- 
ted as a function of direction and curvature (bottom). Despite the input having 
been constant as a function of curvature, the result shows eurvature seleetivity. 
Indeed, one can clearly see in the k > 0 row (Fig. 5, bottom) that the boundary 
of the top left blood cell is traced out in a counter-clockwise manner. In the 
K < 0 row, the same cell is traced out in the opposite manner. (Since the param- 
eterization of the curve is lost when forming its image, we cannot know which 
way the contour was traversed; our result is consistent with both ways.) The 
response for the lower right blood cell was somewhat weaker but qualitatively 
similar. Unlike the direction-only process, the curvature process can effectively 
deal with highly curved contours. 

For our next example, we took two sub-images of a low-contrast angiogram 
(top of Fig. 6; sub- images from left and top right of original). The first sub-image 
(top left) contained a straight structure, which was enhanced by our curvature- 
based CIRF filter (summed responses at top right). The distinct responses at 
separate directions and curvatures show curvature selectivity as well, since the 
straight curvature at 45° had the greatest response (center). The second sub- 

® Code and settings available at http://www.ai.sri.com/~leei/loglin.html. 
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Direction CIRF Output by Direction 




Curvature CIRF Output by Direction and Curvature 




0 : 0° 90° 180° 270° 

Fig. 5. Curvature filtering of a blood cell image (see text). 



image (bottom left) of a loop structure also produced a reasonable filter response 
(bottom right); the individual responses (bottom) also show some curvature 
selectivity. 

As argued in the Introduction, the bug with a direction-only search cone 
would mistrack on a contour as curvature builds up. To make this point compu- 
tationally, we consider an image (top left of Fig. 7) of an Euler spiral extending 
from a straight line segment®. Observe that the contour curvature begins at zero 



We used formula (16.7) of Kimia et al [14], and created the plot in Mathematica 
with all parameters 0, except 7 = 0.1 (Kimia et al’s notation). The resulting plot 
was grabbed, combined with a line segment, blurred with a Gaussian, and then 
subsampled. 
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Fig. 6. Curvature filtering for an ai 
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Original Direction CIRF Curvature CIRF 



Fig. 7. Filtering an Euler spiral without noise (top) and with noise (bottom). The 
original images are on the left; the result after filtering using the curve indicator random 
field based on Mumford’s direction-based Markov process (center) and our curvature- 
based Markov process (right). Notice that the direction CIRF result tends to repress the 
signal at high curvatures, while the curvature process has more consistent performance, 
even at higher curvatures. See text for details. 



(straight segment) and then builds up gradually. To produce a 3-dimensional 
input to our direction-based filter, this original (2-d) image was copied to all 
directions (i.e., m{x,y,9) = image(x, y), for all 6). Similarly, the image was 
copied to all directions and curvatures to produce a 4-d input to the curvature- 
based filter. For this test only, our 2-dimensional outputs were produced by tak- 
ing, at each position {x,y), the maximum response over all directions (for the 
direction-based filtering) or over all directions and curvatures (for the curvature- 
based filtering). The direction-based CIRF posterior mean (with parameters 
(Tfj = 0.025, A = 10, with 64 directions) was computed (center), showing an unde- 
sirable reduction in response as curvature built up. The curvature-based CIRF 
posterior mean (right, with parameters cr^ = 0.05, A = 10,64 directions, and 
7 curvatures (0, ±0.05, ±0.1, ±0.15)) shows strong response even at the higher 
curvature portions of the contour. To test robustness, 0-mean Gaussian noise of 
standard deviation 0.4 was added (bottom left) to the image (0 to 1 was the 
signal range before adding noise). The results (bottom center and right) show 
that the curvature-based filter performs better in high curvature regions despite 
noise. 

Computations were conveniently performed using the Python scripting lan- 
guage with numerical extensions in a GNU/Linux environment. 

6 Conclusion 

In this paper we introduced a new stochastic model for contour curvature to 
more faithfully capture the shape of image curves. Whereas most contour models 
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penalize large curvatures, our curvature Markov process allows highly curving 
contours, and only penalizes changes in curvature. The curvature process can 
be used directly in the curve indicator random field [2,1] to construct a prior 
for curve images. To enhance noisy images of contours, we present a nonlinear 
filter (details in [1]) that approximates the posterior mean of the curve indicator 
random field. Our initial computations show that the filter responds well along 
smooth contours, even those having large curvature. 
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Abstract. We propose a simple and efficient method to interpolate 
landmark matching by a non-amhignons mapping (a diffeomorphism). 
This method is based on spline interpolation, and on recent techniques 
developed for the estimation of flows of diffeomorpliisms. Experimental 
results show interpolations of remarkable quality. Moreover, the method 
provides a Riemannian distance on sets of landmarks (with fixed car- 
dinality), which can he defined intrinsically, without refering to diffeo- 
morpliisms. The numerical implementation is simple and efficient, based 
on an energy minimization by gradient descent. This opens important 
perspectives for shape analysis, applications in medical imaging, or com- 
puter graphics 



1 Introduction 

This paper proposes a new, efficient and consistent method for generating dense 
dilfeomorphisms within an image from sparse information on the displacements 
of a finite number of points (landmarks). This is an important issue for image 
processing and computer graphics, and the problem has generated a large number 
of publications, starting with the seminal papers of Bookstein (see [3] and refer- 
ences therein). There are numerous applications: generating deformations from 
the position of control points is used, for example, to synthesize facial expres- 
sions, or to compute morphings; analyzing variations of shape has application in 
medical imaging or face recognition, matching is essential for the construction 
of anatomical atlases. Jointly with the purpose of interpolating from landmark- 
matching, comes the issue of measuring the discrepancy between two groups of 
matched landmarks. This is not an obvious problem, and it seems quite intuitive 
that the smoothness of the underlying, unobserved, global displacement comes 
as an essential part for the perceptive impression of discrepancy. A third, im- 
portant, feature is the consistency of the interpolated displacement, in the sense 
that it should be one-to-one, ensuring that there cannot be two distinct parts of 
the original picture which are matched to the same zone in the target. 

The method which is described here addresses the three problems simulta- 
neously. It does provide a way to interpolate from landmark-matching, while 
providing a distance between configurations of landmarks which takes into ac- 
count the smoothness of the underlying warping, generated as a diffeomorphism 
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defined on the image grid. As a fourth, non-negligible property, comes the fact 
that this method is easy to implement, and nnmerically efficient. 

To fix notations, let i? be a bounded set in the plane. When (*i, . . . 

(j/i, . . . , un), two sets of N labeled landmarks in Q, are given, we shall deal with 
the problem of finding a diffeomorphism g of Q, with minimal size (in a sense 
to be defined), such that, for all i, g(xi) ~ pi [inexact matching). 

The method which is developed in the sequel takes its roots from three main 
ideas: 

— Interpolating splines, as pioneered by Bookstein in computer vision ([3]), 
and widely used to generate dense warpings from sparse information. 

— Generation of diffeomorphisms as flows (solutions of an ODE) , in a frame- 
work which guarantees smoothness and consistency, as in [10,4] 

— Computation of geodesic distances (minimal path length) on deformable 
data, as used in [11,8]. 

The analysis results in a simple algorithm to compute diffeomorphisms from 
landmark data. 

The paper is organized as follows. We start by reviewing the elements of 
spline theory which will be needed, and relate them to the (non-diffeomorphic) 
interpolation introduced by Bookstein. This forms the first ingredient for our 
method. In a second step, we give a presentation of the theory of groups of dif- 
feomorphisms, generated as flows (solutions of ODEs) on a set Q, and show how 
this framework can be used to generate geodesic distances on structures acted 
on by diffeomorphisms (ie. deformable patterns). The last step will be to use this 
framework on the very simple deformable structure which are sets of landmarks, 
to derive an efficient algorithm for simultaneously computing distances between 
sets of landmarks and generating a diffeomorphism to interpolate the pointwise 
matching. The paper ends with a presentation of some experimental facts and 
data. 

2 Landmark matching and splines 

2.1 Splines 

Like for all landmark-matching methods, the numerical efficiency of the algo- 
rithm that we propose relies on spline interpolation theory. For completeness 
of the presentation, we spend some time in describing the foundations of this 
theory, exhibiting in particular its remarkable algebraic simplicity. 

Formally speaking, spline fitting can be considered a particular case of what 
follows, let be a Hilbert space, let /i, . . . , /jv G 'H, and ci, . . . , cjv G M be 
given. Denote by (. , .) the inner product on 'H. Consider the following problems: 

1. Find h G such that ||/i|| is minimum subject to the constraints (/; , h) = c; 
for ^ = 1, . . . ,N. 
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N 

2. Fixing A > 0, find h such that ||h|p + A^((/,- , h) - c,)^ is minimum. 

i=l 

The flrst jnoblern corresponds to interpolation, or exact matching, the second 
one to smoothing, or approximate matching, and both are solved by elementary 
linear algebra. It is indeed clear that, in both cases, the constraints are not 
affected if h is replaced hy h + v where v is orthogonal to all the /, , so that the 
solution must in fact be searched in the linear space spanned by /i, . . . , /jv: so, 
introduce the N x N matrix S with Sij = (/,■ , fj), and express the unknown h 
as a linear combination h = (^ifi ■ 

Problem 1 now requires to minimize ^aSa subject to the constraint Sa — c 
(w'here a and c are vectors with components a, and c* respectively), and problem 
2 to minimize ^aSa + \ \Sa — c)(5a — c) 

Assume that S is invertible, so that no linear constraint can be deduced 
one from another The solution of problem 1 is in fact uniquely specified by 
the constraint; it is a = S~^ .c. For the second problem, routine computations 
shows that it is a = .c with S\ = S I / \ (the invertibility of S is here not 
required). 

Let us see how this applies to splines. In this context, a set of points (aii, ... , u/v) 
in f2 is given, together with real numbers ci, . . . , cjv, and spline interpolation 
corresponds to finding a real- valued function h (defined on f?), as smooth as 
possible, such that h{xi) = c,- or h{xi) ~ Cj. The smoothness of h is evaluated 
through a norm of the kind 

ll^lk = f \Lh\^dx 
Jn 

where L is, say, a differential operator. This norm defines a Hilbert space of 
functions Hl , with inner-product 

(h , g) — I LhLgdx . 

Jn 

The constraints h{xi) = c, are linear in h, and the issue, to fit in the previous 
abstract setting, is whether there exists an element in such that, for all 
h G Wi h[xi) = (/j;; , h)j^. If this can be done^, the solution of the interpolation 
problem is given by a linear combination of the fx^, the coefficients of which are 
simply obtained by applying the inverse of the matrix of inner-products of the 
fxi to the values of the constraints c,-. A similar conclusion can be drawn if we 
replace this exact interpolation problem by an inexact form, which consists in 
minimizing 

N 
* = 1 

’ I’roblem 1 may have no solution when S is not invertible 

^ This is equivalent, by the Riesz representation theorem, to the continuity of the 
evaluation mapping h h{x) for the norm ||.||i 
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It must be noted that the inner-products , fxj) are, by construction, given 
by /r i{xj) (/ is self- reproducing) , so that their computation is immediate. 

So everything depends on the existence of fx- Theoretical arguments for 
proving this existence can be given (linked, for example, to Sobolev’s inclusion 
theorems when i is a differential operator) but, for practical purposes, it is also 
necessary to have an analytical and computable expression for them. The func- 
tions / are obtained from the Green kernel of the operator L* .L. ^ So, practical 
applicability of spline interpolation depends on whether the Green kernel of K 
is known or not, provided of course it exists at all. 

Well-know cases in which explicit expressions for Green functions are avail- 
able are when T is a variant of the Laplacian (with simple enough boundary 
conditions). However, this can be too specific in some applications, and noting 
that the method never requires explicit knowledge of T, it is possible to start 
directly with a function {x,y) fx{y) provided that one knows that / is the 
Green kernel of some positive operator K (not necessarily a differential opera- 
tor), which needs not be specified. Non constructive existence assumptions exist, 
for example one may require that / is symmetric {fx{y) = fy{x)), continuous, 
squared-integrable and induces a positive operator on the last requirement 
being satisfied when fx{y) = F(x — y), F being the Fourier transform of a 
positive, even function. 

In [2], [1], these requirements are further specialized to ensure that / a radial 
basis function: fx(y) = G{\x — y\), the simplest example being the Gaussian 
G{t) = exp . This additional assumption has the advantage to provide 

rotation invariant interpolation (which is also true when T is a linear combination 
of powers of the Laplacian) . 

2.2 Landmark matching and Bookstein’s splines 

For diffeomorphic matching, it must be taken into account, to apply the previous, 
that the unknown function is vector valued. In fact, if . . . , xjsi and t/i, . . . , j/jv 
are two matched sets of landmarks, one should find a diffeomorphism h : Q ^ Q 
such that h{xi) = yi (equivalently, one searches a displacement u = h — id such 
that u(xi) = yi — Xi). 

Bookstein (see [3]) proposes to apply spline interpolation to each component 
of u. This is the simplest approach, and we will keep to this setting in this 

® Let L* be the dual operator of L, which is such that, for aU g and h with compact 
support in 17, 

f (Lh)g= f {L* g)h 
J n J n 

and let K = L* .L. For aU x, one has 

h{x) = {fx , h)j^ = f LfxLhdy — f fxKhdy 
J n Jn 

which is precisely the definition of the fact that the function (x,y) i-4 fx{y) is 
the Green kernel K. (K is also affected by boundary conditions, but we omit the 
complications here). 
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paper. But it must be remarked that this is not a general point of view, the 
Green function being, in this context, naturally expressed as a matnir of kernels. 
Returning to Bookstein splines, consider the inner-product 

{h,g)= [ AhAgdx. (1) 

Ti being the space of functions with square integrable second derivatives (a 
Beppo-Levi space). This space is not, strictly speaking, a Hilbert space, because 
the inner product is degenerate (it vanishes for affine functions), but can consid- 
ered as one provided one consideres that functions which only differ by an affine 
function are equal. 

With this in mind, let U{r) = r^logr: this function is such that, for any 
smooth function h 



h{x)= U{\x-y\)A^h{y)dy 

where A“^ is the iterated Laplacian, and equality being up to the addition of an 
affine function (cf [7]). Letting fx{y) = U{\x — y|), this is (fx , h) = h(x) (up to 
an affine function). So let c* be one of the components of yi — Xi, so that the 
constraints h{xi) = Ci for i = 1 to N write: there exist a = (a^, a^) £ 6 G ffi 

such that, for all i : {h , f^i) = Ci — ^axi — b . 

Let Sij = (^fxi , fxj') = U{\xi — Xj\). The interpolation problem becomes: 
minimize ^aSa with the constraint Sa -\- Qj = c where 7 = ^(a^a^b) is a 3 X 1 
matrix and Q is a TV x 3 matrix, given by, letting Xi = [xj, xf): 

/ x\ xl l^ 

Q= ; ; ; 

V^TV 1/ 

solving this problem in (a , 7) yields 7 = (* ^ Q) ^*Qc and a = S~^{c — Qj) . 
The smoothing problem requires minimizing 

*aSa + A*(S'q; -|- ^axi + b — c){Sa + *ax{ + b — c) 

and its solution is formally similar to the previous one, simply replacing S by 
S\ = S + (1/A)L in the formulas. 

When this is applied to both components of yi — Xi, one obtains a function 
u such that u(xi) = yi — Xi for exact matching, which thus provides a smooth 
interpolation of the landmark correspondence. However, there is no constraint 
in this approach, which ensures that h{x) = x + u{x) is one-to-one: folding is 
indeed possible, and examples will be given in the last section of this paper. We 
shall obtain a rigorous one-to-one matching using flows of diffeomorphisms, as 
introduced in the next section. 
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3 DifFeomorphic landmark matching 

One way to introduce the groups of diffeomorphisms we shall be dealing with 
starts from standard methods for generating metrics and distances on sets acted 
on by groups. We address this in the next section. 

3.1 Distances and group actions 

General facts We start with a short algebraic section in which basic facts 
on how inducing a distance from a group action are obtained, introducing in 
particular a “least-action principle” . First recall that a distance on a set I is a 
mapping d : M+ such that, for all /, I', I” : Dl) d(I, I') — 0 ^ I = 

D2) d{i, r) = d{r, I), D3) d{i, I") < d{i, v) + d{r, /"). 

If Dl) is not true and d(I , /) = 0 for all /, we use the term pseudo-distance. 
A group G is acting on X if an operation [g, I) g.I is defined on G x I 
with values in X such that id^./ = / and g.[h.I) = {gh).I for all / G IT and 
all h G G. If G is a group acting on X, one says that a distance d on I is 
G-equivariant if and only if, for all g GG, for all I, I' G X, d{g.I, g.I') — d(I, I'). 

We shall be dealing with the following construction. Let G act on X, and 
consider the product O = G x X, so that G also acts on O (simply letting, for 
keG,o^{h,I)e O: k.o = {kh, k.I)). 

For o = [h,I) G O, let 7t(o) = h~^.I. Assume that do is a distance on O, 
and let, for 1,1' GX 

d{I, I') — inf{de)(o, o'), o, o' G O, n{o) — I, n{o') — I'} (2) 

Proposition 1. If do is G-equivariant, then d in (2) is a pseudo-distance on X 

We shall use these results for landmarks, letting diffeomorphisms act on them. 
This however asks the problem of building an invariant distance on 0\ one simple 
approach for this is to use geodesics in this space, as presented in the next section. 



Infinitesimal approach A standard way for building distances on sets like O 
is to compute shortests paths. Assume that we are able to give a meaning of the 
speed Vo(t) = ^ of a path o :t o(t) on O. Assume also that, for each o G O, 
we have a way to quantify the speeds of paths passing through o with the help 
of a norm V i->- ||k||o (the norm depends on o ). Then, let the associated path 
energy be given by 

E(o)=£nvomi^,)dt ( 3 ) 

and the geodesic distance on O be then defined by 

do (o, o') = inf{y£'(o), o(0) = o, o(l) = o'} . (4) 

To build a G-equivariant distance (as required by proposition 1), it suffices 
to start with a family of norms (||.||o, o G O) which shares this property, in the 
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sense that, if o is a path on O and h ^ G, then the translated path h.o and o 
both have the same speeds at the same times: this writes 

||V,,o(t)|U.o(t) = ||K,(f)||o(t) (5) 

One can interpret this formula with the help of the differential of the action of 
G on O, but the meaning and the conseqnences of (5) will be easily derived, and 
we will not need to introduce the usual machinery of differential geometry. 

It is important to notice that this condition provides norms of velocities at 
translated objects h.o as soon as these norms are known at o. Since O = G x I, 
it thus suffices to define ||.||o for o G 0 of the kind o = (id G, I)- 



3.2 Mixing deformations and object variations: landmark matching 

We specialize the point of view of section 3.1, by letting G be a group of diffeo- 
morphisms of 17 and I be the set of all collections of N landmarks on 17. An 
element of I is thus a TV-tuple I = [pi, . . . , pjv) G f7^ . We use on G the product 
gh = ho g and define the action of G on I to be 



g i = ia ■ ■ ■ ,9 ^{pn)) 



which does provide a left-action: {gh).I = g.ih.I). 

Following the lines of section 3.1, we consider paths on O. Such a path takes 
the form o(f) = (g(f, .),Pi(f), . • . ,Pw(f)) where g(f, .) is a time dependent dif- 
feomorphism and Pi(f) is a curve in 17 for * = 1, . . . , TV (landmark trajectory). 
The velocity at o{t) is 







with l/g(f) = ^. 

Let h he & diffeomorphism; the path h.o is given by 

h.o{t) = (g(T, h{.)),h~^{pi{t)), ... , h“^(pjv(f))) 



and its speed is 

Vh.ait) = (Vgo h{t), 

where Dph~^(t) is the differential of h~^(.) with respect to spatial coordinates, 
evaluated at a point p ^ Q. Equation (5), requires that 

||14,o(f)||ft.o(t) = ||Go(f)||o(t) 

This is true in particular when h is the inverse of g(f, .), yielding 
||bh(f)||o(«) = ||fg-io(f)llg-i.o(«) 
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so that it is only necessary to define norms at elements o = {id, pi, . . . ,Pn)- We 
have, 

in which we have let Vg(t, y) = Vg(t, y)). Making the change of variables 

q;(t) = g(t,Pi(0); this writes 

Vg-i„(t) = (^Vg(t),^(t)+Vg(t,qi(t)),... Vg(t,qjv(t))j 

and the energy of the path o{t) takes the form 

E{o)= Vg(t),^(t) + Vg(t,qi(t)),... ,^(t)-Vg(t,qiv(t)) dt 

with a certain norm, which may depend on the current position (qi (t), . . . , qjv(t)). 

The essential trick, for building diffeomorphisms (in theory and in practice), 
is to replace the unknown time-dependent diffeomorphism g by its so-called Eu- 
lerian velocity Vg on which everything now depends. Noting that, by definition: 

^ = Vg(t,g(t)) 

knowing v allows to compute g by integration of an ODE, providing in that way 
a flow of diffeomorphisms (under smoothness conditions on v which have been 
studied in detail in [10] and [4]). 

So, we know think in terms of v rather of g. To compute the distance between 
two elements of O, it suffices to minimize the energy of the paths which link them. 
We now restrict to the particular case when 

^’(o) = / / \L^fg{l)\‘^dtdx + J2 f dt 

Jo Jft Jo 

and L being some operator acting on u(t) for all t. We are interested by the 
distance between two sets of landmarks 1 and I', which is given, according to 
( 2 ) 

d{T, P) = inf{c/o(o, o'), o, o' E O, 7t(o) = I, n{o') = T} 

where 7r(y,pi, .. . ,pn) =y“i(pi,... ,pn) = {g{pi),... ,y(piv)). ft is not difficult 
to check that in fact, d(I, I') is the infimum of 

/o ^ / l^(^) - v(t,qi(t))pdt (6) 

over all time dependent velocities v on L2, and over all curves qi(.), . . . ,qjv(-) 
such that F = (qi (0), . . . , qAf(O)) and F' = (qi (1), . . . , q;v(f )). We thus obtain 
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a new landmark-based matching formula, which only involves the velocity, and 
which, in the same time, provides a distance between sets of landmarks. For 
fixed trajectories = I, N , the optimal v can be explicitely computed 

at each time t, in function of the Green kernel of L*L, as developped in section 
2.1; this is the basis of the numerical algorithm which is detailed in section 4. 

The final form of the energy is somehow reminiscent from S. Joshi’s landmark 
matching method, which is also based on fiows of diffeomorphisms ([6]) . The main 
difference comes from the fact that Joshi’s method does not optimize landmark 
trajectories, but rather uses an end-point matching penalty which leads to an 
optimal control formulation. There are two consequences of this; the first one is 
that this does not provide a metric between landmark configurations, whereas 
the methods derived here provides this feature as an initial motivation. The 
second one is that the numerical problem in our case is much simpler, as will be 
seen in section 4. 



3.3 A Rieinaniiian metric on deformable landmark configurations 

An interesting feature of the previous construction can be pointed out here; it is 
that minimizing (6) with boundary conditions only in q is equivalent to compute 
a geodesic path for a specific metric on the set of configurations of N landmarks. 
Indeed, for I = [qi, . . . , q^) G , and for h = (hi, . . . , hjv) G define 



= min / \Lv\dx + Xy \hi 



v{qi)Y 



It is easy to prove that ||h||/ is a norm as a function of h, which therefore 
provides a Riemannian metric on , and to check that the optimal trajectories 
I(.) = (qi(-), • . • ,qjv(-)) minimize the energy 




with fixed boundary conditions at time 0 and at time 1, yielding the fact that 
they are geodesics. Noting, moreover, that, for a given h, the minimizing v in the 
definition of ||h||/ can be explicitely computed, provided that the Green function 
of L* L is known (cf section 2.1), we finally obtain an explicit view of as a 
Riemanian manifold, in which diffeomorphisms are now only implicit. 



4 Experiments 

4.1 Implementation details 

The simplest numerical scheme is not to directly work with the geodesic energy 
of the previous section. We give ourselves a Green function, denoted /, as in 
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section 2.1, associated to some operator L we do not need to compute. For each 
t, the optimal v must have the form 

N 

where for each k and t, ak(t) is a 2D vector. The energy in (6) writes, as a 
function of a and g: 



JV .1 

E{a,q)='^ / {ak{t) , ai{t))f{qk{t),qi{t))dt 



“h A 



n 

u 






N 



l=l 



dt (7) 



in which we use norms and inner products in M^. Now, for fixed q, the optimal a 
is explicit. Let S{t) be the N x N matrix with coefficients f{qi{t), qk(t)), and let 
5 a(1) = S{t) + I/\. Letting a* = (a\, , ajy) and g* = {q\, . . . , g]y), i = 1, 2, 
we have, for all t 

aHt) = [Sxit)]-^ .qHt) 

Minimizing in g with fixed a is not explicit, but the gradient of if in g 
can be computed. We assume a time discretization of order T, and set, for 
/ = 0, . . . ,T —1: DfU = T{u(t + 1) — n(f)) for a time dependent function u. The 
discretized energy takes the form 



N T 

E{a,q) - EE {ak{t), ai{t))f{qk{t),qi{t)) 

k,l=l t=0 



N T-1 

>EE 

k-l t-0 



N 



Dtqk -’^ai{t)f{qi{t),qk{t)) 






(8) 



N 

Introduce the notation Zk (t) = Dtqk ~ E mit)fiqi(t),qk(t)). 

1 = 1 

The partial derivative of E with respect to qk(t) is a 2D vector, given by, for 
dE ^ 

ctt{t))Vif{qkit),qi{t)) -2XTDt-iZk 

N 

-2X^i{ak{t) , Z,it)) +{ai(t) , ZkmVifiqk{t),qi{t)) (9) 
1=1 
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where Vi/ G denotes the gradient of / with respect to one of its variables 
(it does not matter which one, / is symmetric). Note that q{0) and q{T) are 
clamped to the bonndary conditions. 

The optimal landmark trajectories are computed by iterating the following 
steps until convergence (initializing with a = 0 and linear trajectories) 

1. Gradient step for q: for all i = 2, . . . T — 1, substract to qk{t) a quantity 

. , dE 

proportional to — — 
oqk[t) 

2. Velocity updating: set a‘(t) = [S'a(^)]~^ .q’ (t) , t = I, . . . ,T — I, i = 1,2 



Thin-plate generalization We now show how this can be modified to incor- 
porate affine invariance as in the thin-plate case. If L is the Laplacian, we have 
seen in section 2.2 that, for each t, v(t,x) is defined up to an affine component 
which may in turn be optimized. This yields an expression 

N 

v(t, x) = a{t)x + h{t) -f ^Ofe(t)/(gfe(t), x) 
k=l 



for the unknown velocity field, a{t) being a 2x2 matrix and h{t) G ffi^. Let 7*(i) 
be a column of the 3x2 matrix \a{t)b(t)], i = 1, 2. When trajectories are fixed, 
the optimal coefficients are given like in section 2.2 by 

f (t) = 

where S\ is as above and 



and o® (i) = 






Q(t) = 



Qm 1 

\97v(^) 9w(^) 1 



The modification of the gradient equation in q{t) is almost straightforward 
and we only give the result in discretized form, letting 

N 

Zk{t) = Dtqk - a{t)qk{t) - b{t) -^ai(t)f{qi{t), qk{t)) 

l=i 



so that the minimized energy is 

E{a,q) = f^f^<a,(t),a;(t))/(g4^),9iW) (10) 

k,l=lt-0 k=l t-O 
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Fig. 1. Random point displacements: small deformations; upper-left: evenly spaced 
points and random displacements ; upper-right: optimal trajectories ; down-left: esti- 
mated diffeomorpliism ; down-right: interpolated displacement field by classical sphnes. 



We have 



= ‘2j2{ak{t) , ai{t))Vif{qk{t),qi{t))-‘2XTDt-iZk 

N 

- 2A^(K(0 , Zi{t)) + {ai(t ) , Zk{l)))Vif{qk(l),qi(t)) - • 

^ = 1 

( 11 ) 



Note that we wrote everything for 2D matching, but that the formulas can 
obviously generalize to any number of dimensions. 
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Fig. 2. Random point displacements: average deformations; upper-left: evenly spaced 
points and random displacements ; upper-right: optimal trajectories ; down-left: esti- 
mated diffeomorphism ; down-right: interpolated displacement field by classical splines 



4.2 Experiments 

In the proposed experiments, random displacements are attributed to points 
evenly distributed on a grid. Interpolated deformation and landmark trajecto- 
ries are computed. This is compared to classical spline interpolation (which just 
corresponds to T = 2 in the previous algorithm). The Green kernel, / is Gaus- 
sian; f{x, y) = exp (— |k — t/p/2(T^) . 

Three experiments are presented. The first one generates small displacements, 
the last one very large ones. Progressively, one sees the singularities generated 
by classical interpolating splines increase, foldings being created, while geodesic 
splines remain one-to-one, and of rather impressive smoothness. The estimated 
landmarks trajectories start to bend in the second experiments, and are clearly 
curved in the last one. 
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Fig. 3. Random point displacements: large deformations; upper-left: evenly spaced 
points and random displacements ; upper-right: optimal trajectories ; down-left: esti- 
mated diffeomorphism ; down-right: interpolated displacement field by classical splines 
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Abstract. By exploiting an analogy with averaging procedures in fluid 
dynamics, we present a set of averaged template matching equations. 
These equations are analogs of the exact template matching equations 
that retain all the geometric properties associated with the diffcomor- 
phism group, and which are expected to average out small scale features 
and so should, as in hydrodynamics, be more computationally efficient 
for resolving the larger scale features. Prom a geometric point of view, 
the new equations may be viewed as coming from a change in norm that 
is used to measure the distance between images. The results in this pa- 
per represent first steps in a longer term program: what is here is only 
for binary images and an algorithm for numerical computation is not 
yet operational. Some suggestions for further steps to develop the results 
given in this paper are suggested. 



1 Introduction 

1.1 Previous Work 

Deformable template matching is a technique for comparing images with applica- 
tions in computer vision, medical imaging and other fields. It has been reported 
on extensively in the literature. See for example, Younes (19), Trouve (17, 18), 
Grenander and Miller (5) and the references therein. 

Template matching is based on the notion of computing a deformation in- 
duced distance between two images. The “energy” required to do a deformation 
that takes one image to the other defines the distance between them. The defor- 
mations are often taken to be diffeomorphisms of the image rectangle, i.e smooth 
maps with smooth inverse. The energy can be defined using various metrics on 
the space of diffeomorphisms. In addition to diffeomorphisms, which are merely 
a change of coordinates of the underlying image rectangle, one can also allow 
changes to the pixel values. Trouve (17, 18) develops such a theory and gives 
several numerical examples. He gives conditions on the metric that are suffi- 
cient to make the space of deformations a complete metric space. He works with 
a subgroup of homeomorphisms as the space of deformations and allows pixel 
value changes by using a semidirect product with a group that acts on the pixel 
values. The paper by Dupuis, Grenander and Miller (3) also derives conditions 
for existence of template matching solutions. 

Recently a partial differential equation for template matching was derived 
by Ratnanather, Baigent, Mumford and Miller (15), for both exact and inex- 
act matching. In exact matching the two images being compared have to be 
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diffeoniorphic and in inexact matching they need not be. Their derivation was 
done using Euler-Poincare reduction theory and also using classical calculus of 
variations. For an early version, which does not use the Euler-Poincare theory, 
see Mumford (14). For Euler-Poincare reduction theory, see Marsden and Ratiu 
(8, Chapters 1 and 13). This technique is useful for computing Euler-Lagrange 
equations when the Lagrangian is invariant under the action of some Lie group. 
For example, it is possible to do a variational derivation of the Euler equations 
of rigid bodies and fluid mechanics using Euler-Poincare reduction theory. 

In their most general form as given in Mumford (14), the exact template 
matching equations (TME) depend on the choice of a self-adjoint operator that 
appears in the definition of the metric on the group of diffeomorphisms. When 
this metric is we will refer to the equations as L^-TME. 



1.2 Contributions 

In this paper we derive the isotropic averaged template matching equations, 
(H^ATME), which we hope will be a version of the exact template matching 
equations that average out small scale features, yet retain the larger scale fea- 
tures. The refers to a weighted Sobolev metric that we use instead of the 
metric on the group of diffeomorphisms. 

Thus the H^^-ATME are derived by making a special choice for the self-adjoint 
operator that appears in the derivation of Ratnanather, Baigent, Mumford and 
Miller (15). We expect that the averaged equations and even their anisotropic 
counterparts may also be of interest in computer vision. These might allow tem- 
plate matching while ignoring features smaller than a chosen size ( a in equation 
(13)). The RRATME derivation was inspired by recent work on Lagrangian av- 
eraged equations in fluid mechanics as described in Marsden and Shkoller (10) 
and references there in. 

By analogy with fluid mechanics the HRATME may be much more amenable 
to numerical solution than the L^-TME. Finally, by allowing the ignorable fea- 
ture size to vary it may be possible to perform template matching more robustly 
using a multiscale approach. 



1.3 Overview 

We first set up the framework of template matching. For this paper, the main 
task of template matching reduces to defining distance between binary images. 
After describing the framework we give the definitions and facts that are needed 
for our derivation. These preliminaries include a brief summary of Euler-Poincare 
reduction in Section 3.3. Before giving a derivation of H^ATME we repeat the 
derivation of the TME of Mumford (14) for the special case of the L^ metric. 
The main result in this paper is the derivation of the isotropic averaged tem- 
plate matching equations for the exact matching case and this derivation is in 
Section 6. 
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2 Our Framework 

The basic element of template matching is the computation of a distance be- 
tween two images and the computation of a deformation that takes one image to 
the other. For concreteness and simplicity we will limit our attention to binary 
images, i.e. an image will be a characteristic function on some bounded open 
subset M of K", n > 1. Thus our space of images is P ^ {f \ f : M ^ {0, 1}}. 
We do the derivations for a general n since those are just as easy as derivations 
for the case n = 2. For n = 2, M will typically be a rectangle in the plane. 

In order to define a metric on P, given two images / and g one finds the 
“smallest” map cp : M ^ M such that f = g o p. In Section 3.2 we show that 
this smallest map (p induces a pseudometric on P. This approach does not allow 
one to modify the range of the images, p is just a change of coordinates for M. 
A more general framework that allows one to modify the range of the images 
(i.e. modify the pixel values) by using semidirect products is described in Trouve 
(17). 

In order to define the “smallest” map p mentioned above, one must define 
a metric on the space to which p belongs. In addition one has a choice of what 
space to use as the source of the maps p. These choices and an analysis of their 
implications require extensive and subtle analysis that uses many mathematical 
tools. A nice discussion and use of these subtleties is in Trouve (17). In this 
paper, to keep things simple, we will ignore these subtleties. We will take the 
space to which the maps p belong to be the space of all diffeomorphisms of M 
fixing its boundary pointwise and denote this space by Diff(M). 



3 Preliminaries 

3.1 Facts about DifflAf) 

We will require some basic facts about Diff(M) and its tangent spaces which we 
state here, some without proof. We will ignore the difficulties associated with 
defining a differentiable structure on Diff(M). For details on this, see Ebin and 
Marsden (4). See also Marsden and Ratiu (8) for an elementary discussion. The 
most important fact is that a vector in the tangent space of Diff(M) is a vector 
field on M. This is stated more formally in Facts 1 and 2 below. 

Fact 1. The tangent space o/Diff(M) at the point p G Diff(M), is the space of 
all material (i.e Lagrangian) velocity vector fields V ^ over p on M that vanish 
on the boundary dM of M. This tangent space is denoted by T^(DiW{M)). 

Fact 2. The tangent space o/Diff(M) at identity e e Diff(M), i.e. Tf.{I)i^{M)) 
is the space X{M) of all spatial (i.e Eulerian) velocity vector fields on M that 
vanish on the boundary dM of M. 

We also need some facts about the tangent map (derivative) of the action 
of Diff(M) acting on itself. Let Diff(AL) act on itself on the right by function 
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composition. Thus for f,r] £ Diff(M) the right action of r] on Lp is ip ■ t] = 
R.rii^p) = f ° 'H ^ where i?,^ denotes right multiplication of the argument by t]. 
Now we compute the tangent lifted action, i.e TR,j which is the derivative of the 
right action described above. We show that 

Fact 3. If Rrj is the right action defined above, then its derivative is 

T^RriiVip) = V,j,or] 

for all V^p e T|p(Diff(M)). This is ealled the right aetion by i] on V^,. 

We recall the proof of this standard fact for the reader’s convenience. 

Proof. The proof essentially consists of checking definitions. Let (ft C Diff(M) 
be a smooth curve such that ipo = and d/dt\t=o(pt = V Thus for every X G 
M, d/dt\t=op>{X,t) = Vp(X). Then by definition of derivative Tp,Rr^(V p,) = 
d/dt\t=o{>pt ° v)- Thus, Tp,Rrf{V^){X) = d/dt\t=oP’t{'n{X)) = Vp{rj{X)). □ 

Definition 1. An inner produet on Diff(Af) is said to be right invariant under 
aetion by Diff(M) if {Vp,,Up,) = {TpRr,{Vp,),Tp,Rn{Up,)) = {Vp,orj,U^ori) 
for all <p,r] £ Diff(M) and V^,U,p £ Tp(Diff(M)). Here (•,■) denotes an inner 
product on Diff(M). 

3.2 Pseudometric Using DifFeomorphisms 

We induce a pseudometric (Abraham, Marsden and Ratiu (1)) on the space of 
images V from a metric defined on Diff(M) as follows. 

Definition 2. The positive valued funetion d-p : V x V ^ M., is called a pseu- 
dometric induced on V from Diff(M), if for any f,g£V 

d-p(/, g) = inf{d(e, >p)\p £ Diff(M) and f = g o ip} , 

where e is the identity diffeomorphism, d{e,ip) is the geodesie distance between 
e and (p and inf stands for infimum, or greatest lower bound. 

As usual, the geodesic distance on Diff(M), is defined in terms of the inner prod- 
uct on the tangent spaces of Diff(M), i.e. the Riemannian metric on Diff(M). 
One must prove that d-p as defined above is actually a pseudometric, i.e. that it 
satisfies the symmetry and triangle inequality properties as well as the property 
that dp(/, /) = 0 for all / e V. This is stated in the following Fact 4. See Miller 
and Younes (11) for a sketch of a proof or Hirani, Marsden and Arvo (6), which 
is the technical report version of the present paper, for a more detailed proof. 

Fact 4. If the Riemannian metrie on Diff(M) is right invariant under aetion by 
Diff(M), then the function dp of Definition 2 satisfies the pseudometric axioms, 
namely that 



dp(/, /) = 0 for all f £V ; 
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d-p(/,5) = d-p(5', /) for all f,g eV (symmetry) ; and 
3. d-p(/, /i) < d-p(/, 5) +A-p{g,h) for all f,g,h eV (triangle inequality) . 

Furthermore, if the infimum in Definition 2 is aehieved, then d-p is a metric on 
V, in which case property 1 becomes 

1. dp(/, ,g) = 0 f = g (definiteness) . 

One can also show the same result when right invariance is replaced by left 
invariance or bi-invariance. 

Thus to compute dp{f,g) we need to find the smallest diffeomorphisin p 
in Diff(M) such that / = g ° F- The definition of smallest depends on the 
chosen metric on Diff(M). The typical strategy for this is to find a geodesic on 
Diff(M), from the identity map to the unknown diffeomorphism tp. The unknown 
diffeomorphism satisfies the constraint / = gotp. Any such p will be the smallest, 
since it will be the diffeomorphism closest to identity, which also satisfies / = 

gap. 

There may or may not be many such smallest diffeomorphisms, but it is 
sufficient to find one, in order to solve the template matching problem. The lack 
of uniqueness, if present, may have practical implications for numerical solvers, 
which is an issue we have not yet addressed. Moreover, there may not exist 
any such p ; for example this is the case when the image corresponding to / is 
not homemorphic to that corresponding to g. The existence issue also requires 
further investigation. Trouve (17, 18) gives conditions that a metric must satisfy 
for existence and uniqueness of minimizers in inexact matching. 



3.3 Euler-Poincare Reduction 

We now recall some facts about Euler-Poincare reduction theorem that we will 
need. For details, see Chapter 13 of Marsden and Ratiu (8). Euler-Poincare 
reduction is useful in mechanics. For example, it is possible to do a variational 
derivation of the Euler equations of rigid bodies and fluid mechanics using Euler- 
Poincare reduction theory. Consider a Lagrangian L i.e. a map L : T(Diff (M)) — > 
M, so L is a function of p £ Diff(M) and p G Tp(Diff(M)). If this Lagrangian 
is invariant under right action by Diff(M) then we can use the Euler-Poincare 
reduction theorem (Marsden and Ratiu (8) Theorem 13.5.3). According to this 
theorem, the following two statements are equivalent : 

1. The variational principle 5 L{p{X, t), p{X, t)) dt = 0 holds for variations 
of curves p{X,t) with fixed end points, i.e. for Sp{X,a) = 6p{X,b) = 
0 VX G M ; 

2. The variational principle d l(u(x, t)) dt = 0 holds on X(M), i.e. on the 

tangent space at the identity of Diff(M), using variations of the form Su = 
w + . This is called the reduced variational principle. 

Here p{X,t) = d/dtp{X,t) (keeping X fixed). The vector u is the tangent 
vector p moved to identity e G Diff(M) by right action by pf^, i.e. u = p o 
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Subscript t here denotes fixed time t not derivative w.r.t t. More precisely, 
if X = ip{X,t) = ifit(X) then u{x,t) = By this we mean that 

first the time derivative of ip{X,t) is computed keeping X fixed and then one 
substitutes X = X{x,t) = into the resulting expression. The function 

I : Te(Diff(M)) K is simply the restriction of L to the tangent space at 
identity e. 

The vector w is the vector Sif moved to identity in Diff(M). Thus w = 5ipt o 
The notation [w, u]l = {u ■ V)w — {w ■ V)m is the Jacobi Lie bracket. Note 
that w{x, a) = w{x, b) = 0 Vx G M since S(fi{X, a) = 6ip{X, b) = 0 MX e M. 



3.4 Gauss-Green Theorem and Its Corollary 

We will need some basic facts from vector calculus, for the derivation of the 
L^-TME and H^-ATME. We state these facts here. 

Fact 5. (Gauss-Green Theorem) Let M be an open bounded subset of K" 
and suppose that the boundary dM is . Suppose u G C^{M). Then 




for all i = 1, . . . ,n and where z> = is the outward pointing unit 

normal field on dM . 

A simple corollary of the Gauss-Green theorem is the following fact which we 
will use several times. 

Fact 6. Let M be an open bounded subset o/ K" and suppose that dM is . 
Let u,v,w be veetor fields on M. Then 




div u(u, w) + {u, {v ■ V)m) + {{v ■ V)it, w) dx 



/ {u,w){v,v)dS 

JdM 



( 1 ) 



where t) = . . . , is the outward pointing unit normal field of dM. 



Proof. We will use the Gauss-Green theorem (Fact 5) stated above. By this the- 
orem we have that for alH, j in {1, . . . , n}, d/dxj[u‘w^v^)dx = 
dS. Thus 



E 



M 



u w — \- u v^ — \- w v^ — — dx = > 

dxj dxi dxn 

— l ' 



u^w^'v^v^ dS 



dM 



which proves equation (1). 



□ 



4 Metrics on Diff(Af) 

We need to define two different Riemannian metrics on Diff(M), i.e. inner prod- 
ucts on its tangent spaces. One is for the L^-TME derivation and the other one 
is for the HGATME derivation. 
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Definitions. Let V^.U^ G T|^(Diff(M)), i.e. are tangent vectors, 

tangent to Diff(M) at the point ip G Diff(M). Let J{ip){X) be the determinant of 
the derivative of (i.e. the Jacobian determinant of) tp evaluated at point X e M. 
The Riemannian metric on Diff(M) we will use for the If^-TME derivation 
is defined as 



/ {V^{X),U^{X))J{p){X)dX . (2) 

JM 

Here (■, •) is the standard inner product on K". 

The metric on Diff(M) is defined by first defining it on the tangent space at 
identity and then extending it to all of Diff(M) right invariantly. 

Definition 4. Let v,u be vectors in the tangent space at identity e G Diff(M) 
i.e. v,u G Te(Diff(M)). For any a > 0, a G M, define 



= / {v{x),u{x)) + a^y2{'D^v{x),Diu{x)) dx 



M 



i=l 



( 3 ) 



where D^n = d/dxfiv{x)) . The inner products inside the integral are the stan- 
dard inner products ( dot products ) on K” . 

To compute the inner product at a point p G Diff(M) different from identity, 
define . Note that G 

Te(Diff(M)) because right action by p^^ moves the vectors at p to identity 
on Diff(M). Thus we can use (3) to compute the inner product. Note that we 
defined the Lfi inner product at a general point of Diff(M) but defined the 
Hi inner product at identity and showed how it can be computed at a general 
point. This is done for simplicity. The expression for the inner product at a 
general point is simple but not so for the inner product. We will write the 
corresponding norms as follows. The Lfi norm of U G T;^(Diff(M)) is written 
as ||t/(^||L 2 = and similarly for the norm. 

The geodesic distance between p^ and p^, on Diff(M) is defined as 



d{pa,Tb) = i dt p{a) = pa,p{b) = pb 



and the infimum is taken over all smooth parametric curves p : [a, b\ -G Diff(M) 
from Pa, to Pb. Here (•, ■) is any Riemannian metric on Diff (M). Thus d{pa, Pb) = 
dt, where p is the curve between the endpoints that makes 
faij’{t),p{t)y^^ dt stationary. 

The same curve makes the functional f’^[p{t), p{t)) dt stationary. This fact 
is sometimes stated as “minimizing length is the same as minimizing kinetic en- 
ergy” . One way to prove this fact is by computing the Euler- Lagrange equations 
for both the integrals and noting that the equations are the same. 
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4.1 Right Invariance of Metric 

We now prove the right invariance property of the metric in Definition 3. 
This property is crucial for application of the Euler-Poincare reduction theorem. 



Claim 1. The metric defined by (2) is right invariant under action of the group 
Diff(M) acting on Diff(M), i.e. for any ri e Diff(M) 



{TipRr,{V ^),T^pRrf{U ^p))-i,2 = . 

Proof. By the computation of the tangent lifted group action in Fact 3, this is 
equivalent to showing that {V^p or], U ^ or])i 2 = {Vp,Up)^2 . The left hand side 
above is equal to jj^{Vp{r]{X)),Up{r]{X)))J{(por]){X)dX . Note that the 
argument ip o rj of J is the base point of TpRr,{Vp) and T^R^jiUp) as required 
by the definition of the metric. Using chain rule for J{p o f]){X) the above 
integral becomes {V p{r]{X)),U p{r]{X)))J{p){r]{X))J{r]){X)dX . Now use 

the change of variable Y = r]{X). By the change of variables theorem then dY = 
J{r]){X)dX and the above integral becomes {Vp{Y),Up{Y))J{(p){Y)dY. 
Since rj{M) = M the above is equal to {V^, Up)i ^2 as desired. □ 

We now give the intuition behind the form of the inner product. Specif- 
ically we address the question of why the Jacobian determinant term appears 
in the inner product definition (Definition 2). It appears so that the inner 
product can be made right invariant. Right invariance of the metric on Diff(M) 
implies that the induced function dp is a pseudometric (Fact 4). Thus right in- 
variance (or left- or bi-invariance for that matter) is a convenient assumption. 
Moreover, the distance between two images should not change if they are both 
distorted by the same change of variables. This also makes the requirement of 
invariance attractive. 



4.2 Right Invariance of Metric 

Since the metric was defined at identity and extended in a right invariant 
fashion, the check for right invariance is easy. For completeness we give it below. 

Claim 2. The H), metric defined by (3) is right invariant under action of the 
group Diff(M) acting on Diff(M), i.e. for any r] G Diff(M) 

{TpRr,{Vp),TpRn{Up))}ii^ = {Vp,U . 

Proof. By Fact 3, it is enough to show that {V^ orj,Up or])iii = {Vp,Up)ni 
for all 7 ] G Diff(M). But by definition of 

{Vporj, Up or])^i^ ={VpOr]o{po Up orj o {p o 

= {V p,U ■ 



□ 
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5 Template Matching Equations 

Both the exact and inexact TME, using Euler-Poincare reduction theory, and 
also using classical calculus of variations, were derived recently, and communi- 
cated to us by Ratnanather, Baigent, Mumford and Miller (15). An early version 
without the use of Euler-Poincare theory appears in Mumford (14). We should 
note that although Trouve (17) does not mention Euler-Poincare reduction, and 
does not give a PDE for template matching explicitly, he was certainly aware 
of, and used the idea of moving back and forth between the tangent space at 
identity and a general point of Diff(M). Por completeness, we now give the 
Euler-Poincare derivation of the L^-TME in our notation. We do the derivation 
of the exact equations, i.e. it is assumed that the two images being compared 
are diffeomorphic. 

5.1 Derivation of Exact L^-TME 

We have seen in Claim 1 that the metric of Definition 3 is right invariant. 
Thus if we define a Lagrangian L : T(Diff(M)) — > K as 

(4) 

then it will also be right invariant under that action of Diff(M). 

Let if : [0,1] — >■ Diff(M) be a smooth parameterized curve in Diff(M) be- 
tween the points e = 99 ( 0 ) (identity map) and '^{1) £ Diff(M). The point 99 ( 1 ) is 
such that f = go[if[l)) for the given images / and g. Take a smooth family of 
curves (/9g with the same end points (p(0) and ;/9(l) and such that (fo = (p. Define 
the variations of the curve p to be the vector field 5p = d/de\^=o'^t along p. 

Consider the variational principle 6 L(p(t), p(t)) dt = 0 , with the above 

variations. By the discussion in Section 3.2 the solution of the variational prin- 
ciple above is a geodesic on Diff(M) under the metric from the identity map 
e G Diff(M) to the diffeomorphism p{l) which satisfies the condition of match- 
ing, i.e. f = go ((/^(l)). Due to the right invariance of the Lagrangian we can use 
the Euler-Poincare reduction theorem (see Theorem 13.5.3, Page 437 of Marsden 
and Ratiu (8) and Section 3.3 of this paper). By applying this theorem, we will 
get a variational principle on X(M) (called the reduced variational principle) 
and hence a differential equation in terms of Eulerian veclocity vector fields on 
M . These are the exact L^-TME derived by Ratnanather, Baigent, Mumford 
and Miller (15). 

The reduced variational principle uses the reduced Lagrangian I which is 
a function on the tangent space at identity of Diff(M), namely on X{M), the 
space of all spatial or Eulerian velocity vector fields on M. As noted in Section 
3.3 the function I is just the restriction of L to Te(Diff(M)). Eurthermore, L 
as defined in equation (4) is right invariant under right action of Diff(M). As a 
result L{p, p) = L(e, p o p^^) = l{u) where u = p o p^"^ . Thus by Definition 3 

Ku) = \f \\p{x,t)\\^J{^t){x)dx , 

^ JM 
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where the overdot is time derivative keeping X fixed and the norm || • || is the 
standard norm in M". Let X = Then dX = dx. Thus, 



^ J M 

= 11 \\u{x,t)fdx = i ||m||l 2 . 

^ .1 M ^ 

Let us call the functional for the reduced variational principle where 



S{u)= [ l{u{t))dt=^ f ||■u(x,^)j 
.70 ^ Jo 



l2 dt 



Then S£ = [q {u{x,t),5u{x,t)) dx dt , where the inner product inside the 
integral is the usual dot product in K" . Inserting the definition of 5u from Section 
(3.3) in this integral and setting the resulting expression to 0 we get 



/ / {u{x,t),w{x,t) + [w,u]L{x,t)) dxdt = Q. 

Jo Jm 

Now substitute the definition of [w, u]l from Section 3.3, or page 20 of Marsden 
and Ratiu 8. With this the above equation becomes 



/ / (it, w) — (m, {w ■ V)ti) + (it, (u ■ V)iu) dxdt = 0 . 

Jo Jm 



( 5 ) 



Using integration by parts on the time variable for the first term, and the 
fact that w{x, 0) = w{x, 1) = 0 Vx e M (see Section 3.3) implies that 



/ {u{x,t),w{x,t)} dx dt = — / / {ii{x,t),w{x,t))dxdt . 

Jm Jo Jm 



( 6 ) 



For the second and third term of equation (5) also, the goal is to rewrite those 
in the form {■,w). To bring the second term into the required form, note that 
{w ■ V)it = D It ■ It) where D denotes the spatial derivative. Thus 



(it, {w ■ V)it) = (it, D It ■ in) = ((D uY ■ u,w) . (7) 



For the third term, we use Fact 6. From Fact 2 it|aM = 0 and so the RHS of 
equation (1) is 0. Thus 



/ 

Jm 






{u, {u ■ V)tu) dx = ~ div u) + {{u ■ V)tt, w) dx 



( 8 ) 



Using equations (6), (7) and (8) in equation (5) one gets that 
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Then since w is arbitrary it follows that 
du 

— — h (D • u + (div u)u + (u ■ V)m = 0 . (9) 

at 

The above equation (9) is the template matching equation, for exact matching, 
i.e. the L^-TME, as communicated to us by Ratnanather, Baigent, Mumford, 
Miller (15). We have just repeated the derivation in our notation for complete- 
ness. Note that for all (x, t) 6 M x [0, 1], (D u{x, t))'^ ■ u(x, t) = 5 V(||u(x, t)|p) 
where now, the norm on the RHS is the standard norm in K". With this, the 
L^-TME, equation (9) can be written in an alternative form as 

^ + {divu)u + {u ■ V)u = -^V(||m||^) (10) 

where u = u{x, t) is the unknown time dependent spatial (Eulerian) velocity 
vector field on M that vanishes on the boundary dM of M . 

These equations can be written more concisely, in the Lie derivative form as 
dp/dt + £uP = 0 where (3 is the one form density associated with u and where £ 
is the Lie derivative. In K", (3 = 0 d"x. These are the Euler-Poincare 

equations associated with the right invariant L^ metric of Definition 3 on the 
diffeomorphism group. The advantage of this form is that it can accomodate 
other metrics, simply by changing (3. This will become clear when we derive the 
H)j-ATME in Section 6.1. 



6 Averaged Template Matching Equations 

We now derive H),-ATME, which are a set of averaged template matching equa- 
tions. These equations are analogs of the exact template matching equations that 
retain all the geometric properties associated with the diffeomorphism group and 
which are expected to average out small scale features and so should, as in hy- 
drodynamics, be more computationally efficient for resolving the larger scale 
features. Erom a geometric point of view, the new equations may be viewed as 
coming from a change in norm that is used to measure the distance between 
images. 

6.1 Derivation of -ATME 

Ct 

The steps in deriving the H)^-ATME are almost identical to those used for L^- 
TME, except that we use the metric (Definition 4) on Diff(M), instead of 
the L^ metric. Thus we start with a Lagrangian L : T(Diff(M)) -> K defined as 

L(v?(t),<^(f)) = i(0(t),(^(t))Hi . (11) 

The initial and final value conditions are the same as in the L^-TME case, i.e. 
f = go ((^ 5 ( 1 )). By Claim 2 this Lagrangian is invariant under the right action of 
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Diff(M). Thus, as in Section 5.1, Euler-Poincare reduction can be applied but 
with a different norm. The reduced Lagrangian is l{u) = |||it||^i . By Definition 
4 of the norm, this implies that l{u) = | {u, u) + o? '^) 

dx. The functional that appears in the reduced variational principle is 

£{u) = / l{u) dt = - / / (u,u) + {'DiUj'Diu) dx dt . 

Jo 2 Jq 

Let Uc be a one parameter family of spatial vector fields on M, depending 
smoothly on e such that uq = u and as usual 5u = du^/d€\^^Q . The vari- 
ations 5£{u) are given by 



6£{u)= I {^,5u)dt . (12) 

Jo 

We now compute the above expression. The integrand is 






l{u^ 



e=0 



f , \ , 2sr 






u. 



dedxi 



dx 



e=0 



But 



Tl r\ Tl 7\ r\ d o'? ') 

E aite a tie a 

r)x, ’ r)er)'r:, ^ ^ dxj dedXi 






1=1 j = l 
n n 



dui d^ui ,9ui d^ul 

^ ^ dxi dedxi ^ dx dxde 

j=i 1=1 j=i 



Now 



'M 



du\ d'^ul 



i=l 



dx ’ dxde 



e=0 



JM^.dx’ dx ^ ■ 



Then using integration by parts. 



/ t 

J M OX OX Jj^/j 



du'’ d{5u^ 



^ n = l i 



dxi dxi 



dx 



/*,?? 






>M—[ 9x] 



(fu* dx 



\ ' du^ X ^ 3 XQ 

y — — 5u dS 



where ix = {i3^, . . . , z^") is the outward pointing normal field on the boundary 
dM of M. But by Fact 2, u\gM = 0 so we get that 




(9u* d{Su^) 
dx ’ dx 



dx = 




9x2 



(5u® dx 



(Au^)6u’' dx 



IM 
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Thus 



XJ f ^ 

{ — ,Su)= / (u, Su) — (Au^)Su^ dx . 
ou Jm ^ 

Substituting this into equation (12) we get that 

« 1 n Tl 

5S{u) = / / (u, Su) — a^'S^{Au^)5u^ dx dt 

Jo Jm 

= / {u — Au, 5u) dx dt . 

Jo Jm 

Now substituting the expression for Su from Section 3.3 we get 



d£{u) 



II — a^Au, w — {w ■ V)u + {u ■ V)w) dx dt 



/O JM 



We move the derivative operators away from tu, as in the L^-TME derivation in 
Section 5.1, to get an integrand of the form {-,w). Because of the arbitrariness 
of m, we get the isotropic averaged H), template matching equations, i.e. the 
Ri -ATME as 



d 



d 



—u — a^ — Au + it(divit) — (div + iu ■ V)m 

at 



dt 



Q^{u ■ V)Au + (D u)^ 



u 



(T) u)^ ■ Au = t) , (13) 



where A is the componentwise Laplacian and D it is the spatial derivative of 
It. Here u = u{x, t) is the unknown time dependent spatial (Eulerian) velocity 
vector field on M which vanishes on the boundary 8M of M. To make it easier 
to see the relationship between the structure of the H)^-ATME and L^-TME we 
define u = (1 — a^A)u. Then the H)^,-ATME equation (13) becomes 

du 

— — h (D it)^ ■ V + (div u)v + (it • V)u = 0 . (14) 

at 

The Lie derivative form of these equations is d(3/dt + £uf5 = 0 where now (3 is 
the one form density associated with i; = (1 — a^A)u. 



7 Connections with Fluid Mechanics 

We now give the analogy and connections with fluid mechanics which have in- 
spired our present work. We start with the connection between TME and fluid 
mechanics. The L^-TME in n spatial dimensions, namely equation (9) or equiv- 
alently (10), is a higher dimensional analogue of the inviscid Burger’s equation. 
By this we mean that in one spatial dimension the L^-TME reduce to the equa- 
tion ut + iuux = 0 where the subscripts indicate derivatives and u = u{x, t) is 
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the velocity of the fluid at location x at time t. This fact is mentioned in Mum- 
ford (13). Burger’s equation, at least in the initial value formulation, can develop 
shocks. The framework for L^-TME sets this equation in higher dimensions, and 
as an initial-final value problem. It is possible that shocks may develop in this 
formulation also. The well-posedness and possibility of shocks in L^-TME remain 
to be checked. However, as in the case of Burger’s equation and incompressible 
fluid mechanics, they are well-posed for short time evolution; see Marsden, Ratiu 
and Shkoller (9) and Marsden and Shkoller (10) and references therein. 

Our HOATME, on the other hand, contain diffusive terms which may ame- 
liorate some of the analytical problems of L^-TME, and may be a reason to 
expect more stable numerics. In one spatial dimension our H^ATME, equations 
(13) reduce to the shallow water equations ut — u^xt = —iuux + 2uxUxx + uuxxx 
or equivalently Vt + uVx + 2vux = 0 where v = u — Uxx (here a of H^ATME has 
been set to unity). These equations are completely integrable and have peaked 
solitons, as was shown by Camassa and Holm (2). The shallow water equations, 
like Burger’s equation, also have a smooth spray, as was shown by Shkoller (16) 
and hence one has local existence and uniqueness of geodesics. The L^-TME are 
equations of geodesics on Diff(M) under the right invariant metric whereas 
the H)^-ATME are equations of geodesics under a right invariant H)^, metric. 

Recently, averaged Euler and Navier-Stokes equations for fluid mechanics 
have been developed. See for example Holm, Marsden and Ratiu (7) and Mars- 
den, Ratiu and Shkoller (9). Besides having nice analytical properties, prelim- 
inary numerical experiments (see Mohseni et al (12) and references therein) of 
these averaged Euler and Navier-Stokes equations show the possible advantages 
of an averaging approach in the context of numerical solution of nonlinear equa- 
tions of fluid mechanics. 

There is however a very important difference between the template matching 
framework and the usual fluid mechanics framework. In template matching the 
equations have to be solved as an initial-final value problem. Thus one is given 
image / and it has to be deformed to image g moving the pixels in such a 
way that the motion of the image during the deformation satisfies the template 
matching equations. In fluid mechanics one is typically interested in giving some 
initial velocity and studying how the particles move. 



8 Discussion and Future Work 

We have named the H)^ equations averaged equations. This is with the expec- 
tation that they will have averaging properties like the averaged equations of 
hydrodynamics described in Marsden and Shkoller (10). But we have not yet 
done an averaging derivation in the template matching context to see if the so- 
lution does indeed allow one to compare images while ignoring features smaller 
than a. This is the most important task that remains to be done. Eurthermore, 
a condition like f = g ° (</^(l)) will have to be modified in the averaging case 
possibly by preprocessing / and g. 
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It is our expectation that there will be considerable improvement in the 
numerical performance of the H^-ATME. This expectation is based on the anal- 
ogous situation in fluid dynamics in which averaging improves numerical simula- 
tion as demonstrated in Mohseni et. al (12) and references therein. We also plan 
to investigate generalizations of the procedures here using inexact matching and 
semidirect product theory (used for inhomogeneous fluid problems, for exam- 
ple), to generalize beyond binary images. We also want to explore the role of the 
rotation group and reduction constructions. This is based on the fact that the 
notion of shape space is a technique common to both mechanics and computer 
vision. 

Due to an initial-final value formulation, it appears that an optimization 
approach is one way to solve template matching type equations, as is done in 
Trouve (17, 18). Thus when it comes to computations, the PDEs of template 
matching (9) or (13) may, or may not be used directly for some applications. 
In the optimization approach, one goes back to the Lagrangian (material) for- 
mulation on the group and finds geodesics directly, by a gradient descent type 
algorithm for example. 

One benefit of the PDE formulation is theoretical, now one knows what 
equation is being solved. But more importantly, we emphasize that the very 
reason that we were able to arrive at an averaged version of L^-TME was because 
we knew how averaging worked in hydrodynamics by changing the metric on the 
group Difl'(M), and we knew how to go back and forth between an arbitrary 
point on the group and identity using Euler-Poincare theory. Einally, there may 
be other problems within computer vision or outside it, in which the PDEs are 
used directly in the computations. 

As for the existence of minimizer of the energy functional, Trouve (17, 18) 
uses a different definition of distance on Diff(M). He gives a sufficient condition 
on the metric on Diff(M) for the variational problem of template matching 
to have a minimizer. Roughly speaking, this condition (which is part of what 
he calls the admissibility criteria) is that the metric must be as strong as the 

metric. By Sobolev embedding theorem this implies that an metric will 
satisfy this condition iff s > n/2 + 1. Thus for n = 2 or 3, an or 
metric, including the metric will not satisfy this condition. However, note that 
Trouve’s admissibility criteria is a sufficient condition. In practice in template 
matching sometimes the metric Jjyj{u,u) + {Au)^dx is used as mentioned in 
Grenander and Miller (5). This metric also does not satisfy the admissibility 
criteria. This suggests that more work on the theory of existence of minimizer 
for template matching functionals is needed. 

Finally we note that we have only shown that dp is a pseudometric. If the 
infimum in Definition 2 in not achieved dp is only a pseudometric and there can 
exist / and g, f ^ g, but dp(/, 5 ) = 0. What do specific examples (if any) of 
such /, g pairs look like ? Even more interestingly one can ask if for some metrics 
on Diff(M), for example for the H), metric of Definition 4, dp is a metric. We 
plan to investigate these questions. 
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Abstract. We propose a continuous description for 2-D shapes that calculates 
convexity, symmetry and is able to account for size. Convexity and size are known 
to he critical in deciding figure/ground (F/G) separation, with the study initiated 
by the Gestalt school [9] [11]. However, few quantitative discussions were made 
before. Thus, we emphasize the convexity/size measurement for the purpose of 
F/G prediction. A Kullback-Leibler measure is introduced. In addition, the sym- 
metry information is studied through the same platform. All these shape properties 
are collected for shape representations. Overall, our representations are given in 
a continuous manner. For convexity measurement, unlike the 1/0 mathematical 
definition where shapes are categorized as convex or concave, we give a measure 
describing shapes as “more” or “less” convex than others. In symmetry informa- 
tion (skeleton) retrieval, a 2-D intensity map is provided with the intensity value 
specifying “strength” of the skeleton. The proposed representations are robust in 
the sense that small fine-scale perturbations on shape boundaries will cause minor 
effects on the final representations. All these shape properties are intergrated into 
one description. To apply to the F/G separation, the shape measure can be flexibly 
chosen between a size-invariant convexity measure or a convexity measure with 
the small size preference. The model is established on an orientation diffusion 
framework, where the local features, served as inputs, are intensity edge locations 
and their orientations . The approach is a variational one, rooted in a Markov ran- 
dom field (MRF) formulation. A quadratic form is used to assure simplicity and 
the existence of solution. 

Key Words: Early Vision, Shape Analysis, Convexity, Symmetry, Orientation 
Diffusion. 



1 Introduction 

The convexity criteria has been argued to be important in F/G separation (Kanizsa [9] 
or Fig. 1), where a 2-D region can be recognized as figure due to the convexity property 
of its borders. ITowever, the measurement for convexity was left as intuitive and no 
formalism was then offered. The mathematical definition of convexity ' only allows for 
two class of shapes: convex or concave ones (not convex). There is an intuition that some 
shapes are “more convex” than others (Fig. 1 (al) (a2), Fig. 6 (al)-(a3)). To capture this 
intuition, we study a continuous measure of convexity. 

* A shape is convex if given any pair of interior points, the segment connecting these points is 
completely inside the shape 
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(al) 



(a2) 




Fig. 1. The preference of convexity and size. In (al)/(a2), white/black regions are reported as 
figures respectively which argues convexity playing an important role in such a decision. In (hi), 
white strips are likely to be perceived in front of black background which gives an example of 
(small) size preference. 



First, the human visual system tends to ignore minor perturbation (protrusion) of 
shapes. Sueh robustness is therefore a desired feature for a model measuring convexity. 
By a measure with this consideration, figural regions can be selected. We will argue 
that this F/G selection can not be provided by a well-known continuous criteria, called 
shape compactness, defined by the ratio of the square of perimeter to the shape area [1]. 
A related topic is to study a size-invariant measure. Under this measure, shapes can be 
compared to each other, for shapes with or without similar size. Flowever, because the 
human visual system prefers small size objects to be perceived as figures (Koffka [11], 
Fig. l(bl) (b2)), our complete model of shapes should also be able to measure convexity 
integrated with a preference of small size shapes. Our model will provide a parameter 
that can be tuned to a size-invariant convexity measure or one that biases towards small 
size shapes. 

We propose an orientation diffusion process defined in the space x . In our view, 
the elements/features detected at images at the first stage are intensity edge locations and 
their orientations. It is known that human visual perception is based on orientation filters 
(found in area VI and V2 of the brain), providing the orientation of the edges detected. 
Afterwards, a scheme is designed by a computation of propagating the oriented boundary 
information through local interactions. The problem of propagation is formulated as a 
Markov random field. By our model, the problem of F/G can be studied in a wide range 
of imagery. 

Besides the convexity measurement, the symmetry information (skeleton) can be 
derived through the computation of divergence of a vector field which is generated from 
the result of orientation diffusion. Overall, our model is able to (i) deliver simultaneously 
a convexity measure and symmetry information of shapes; (ii) adjust (via one parameter) 
the balance between convexity and size, i.e., it is able to bias to a size invariant model 
or to include size in its measure. 



1.1 Previous Work 

Mumford [13] used Elastica in the smooth curve finding. The line propagation was 
done via a diffusion process in orientation and a translation in the (cos0^ + sin0^)- 
direction (with an exponential decay). Therefore, our model is similar to his. The most 
important difference is that our consideration is made in x rather than in R^, 
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causing all orientations to be considered simultaneously in a minimization formulation. 
Also, we are proposing various shape measures, a topic beyond his work. 

Another interesting technical work was given by Tang et al. [20]. They were only 
concerned with smoothing dense vector field. A technical difference is that their work 
was done in the space of maps with a unit vector for each 2-D point, while we work on 
the space of x S^. Their direction dijfusion was used for image processing and noise 
removal. We use a different diffusion proeess to generate a field based on a local sparse 
input. Also, we are interested in obtaining various measures to describe 2-D shapes, by 
investigating this field. 

Pao et al. [16] have proposed a continuous measure of convexity (let us eall it the 
PGR model). However, various problems exist with their model, perhaps the most serious 
one is that the size of shapes dominates the convexity measure and so small concave 
shapes could be preferred over larger cireles. Thus, not only the model did not separate 
convexity from size, but also the dominance of size might not be desired. 

The work on diffusion and shapes by Kimia, Tannenbaum & Zucker [10] and Siddiqi 
& Kimia [19] are sources of inspiration. The main difference here is that we are examining 
a linear diffusion equation on a more complex space of positions and orientations. For 
searching the skeleton of a shape, we compute the “sinkage” of energy in the shape. It 
is similar to Dimitrov et al. [4], with the computation of divergence of a vector field. 
Although, our vector field is generated by a different method. 

2 Orientation Diffusion 

Given a 2-D region/shape G (boundary and its inside), or its boundary dQ, we discuss its 
convexity and other shape properties. The region/shape or its boundary can be complete, 
or partial where only part of it is visible through the image frame. We use F instead 
for the boundary when the shape/boundary is partially visible. On the boundary, the 
orientation information can be locally represented by unit normal vectors in the inward 
directions (Fig. 2(bl)). Given a set of detectors of the boundary and their orientations, 
we want to derive a field inside the boundary to deseribe various shape properties. 

We adopt an orientation dijfusion / random walk process to solve this problem. The 
approach is in part inspired by Kumaran et al. [12], Pao et al. [16] (PGR model or decay 
diffusion [15]) where their diffusion process was worked on the image space I, and 
also inspired by Mumford [13], Tang [20] where they included the orientation spaee. We 
extend the PGR model to augment the space to I x 0 by adding an orientation dimension 
0 = {6»-27r/T: 0 = 0,1,..., T-1}. 

We will show that our model can provide a more appropriate measure for convex- 
ity than previous works. Moreover, other properties of shapes such as their skeleton 
information will also be obtained. 

2.1 Notations 

For a given region G (represented by T) in an input image^ I = {x = (x, 2 /) £ {0, . . . , Li — 
1} X {0, . . . ,L 2 — 1}, a set of useful information is collected. For an intensity function 
7(x), the edge function 

^ We will use bold face, lower-case letters to denote vectors. 
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e{x;0) 



1 dl{x) 
Mp due 



G[0,1], 



xer 



records the normalized magnitude of the intensity change at pixel x in the direction of 
ug = (cos 0, sin 0); namely, the (unsigned) directional derivative in ug, normalized by 
Mp = max{| I : X G 1 , 0 G 0}, the largest orientational intensity edge in the image 
I. Also, on r, a hypothesis set (or inducers) is defined by 



f e(x; 6) if Ug ■ n(x) > 0 A x G T 
( 0 if ue • n(x) < 0 A X G r. 



( 1 ) 



On the boundary, the vector n(x) is the unit normal vector chosen the inward direction. 
We say that the directions closer to the inward side are in the source mode and the 
directions closer to the outward side are in the sink mode. Finally, a field 



ct(x; 0) : 17 X 0 — > R 



will be evaluated by our orientation diffusion process. 



2.2 Variational Model 

We will create an orientation diffusion process worked in the field |ct(x;0) ;xG17, 0G0} 
by formulating it as a variational problem. 

Local Hypotheses and Data Fitting. The field cr(x; 9) should take the local hypothesis 
values <To(x; 9), where they are available. In our model they are available at all intensity 
edge pixels (pixels on F) according to their strength e(x;0). I.e., we seek the ct(x;0) 
which minimizes 



-E'data(o-lo-o) = X] e(x;0)(a(x;0) -cro(x;0))^ (2) 

(x;e)erx0 

In most cases, the gradient direction is the one with the maximum value of e(x;0), 
associated with either an inward (ctq > 0) or outward hypothesis (ctq = 0). 

For the homogeneous regions (where no intensity edge is present), we extend the 
definition of the functions ao and e by respectively assigning a local hypothesis of value 
0 and a small constant strength, 

CTo (x; 0) = 0 if X G 17 A X ^ r (3) 

e(x; 0) = A if X G 17 A X ^ T . (4) 

The index of the summation in Eq. 2 becomes (x; 0) G 17 x 0 now, taking the whole 
region. The small constant parameter 0 < A ^ 1, known as the decay coefficient^, con- 
trolling the decay effect of the energy when a is away from the inducers (hypothesis 
set). A larger A has stronger effect to bring pixels away from the intensity edges to take 
the value 0 (See Fig. 3(d) & (e)). 

^ When this functional is analyzed as a Markov chain, the parameter A plays the role of decay 
parameter, or the probability of “vanishing”, of the random walk associated to this functional. 
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Smoothness. Propagation of the local hypothesis is done by adding a smoothness term 
that prefers neighboring pixels with similar orientations sharing the similar values. For 
instance, in Fig. 2(a), the vector in position Pq with the orientation 9 and its neighbor Pi 
with the orientation 9 + tjj are encouraged to share similar value of cr. A simple quadratic 
form is applied here by minimizing 

1 

— M{x-,9,‘tp){<7{x+cos9,y+sm9,9+tp) — a{x,9))^ , (5) 

K ^ ' 

'lp=: — K 



for each x and 9. The function M{x]9,iIj) is simply provided by cos('!/’^), giving the 
cosine weighting for the smoothness that concentrates particularly on small deviations. A 
more complicated version of the function M may also depend on x (non-homogeneous) 
or 9 (anisotropic). The parameter^ k G (0,7t), called the deviation factor controlling the 
deviation of orientations for the moving particles. A continuous version of Eq. 5 is used 
here for computational purposes. 



E. 



smooth 



p) = 



nx0 



M {x; 9, tI)) { cos 9 ax+sin9 ay +ipa0)^ dtp 



dxd9 . (6) 



Energy Functional. The total energy functional is the summation of the data fitting term 
(a continuous form is applied by substituting a 2-D Dirac -delta function in Eq. 2 [15]) 
and the smoothness term 



— f^data('^l'^o) T E'smooth(^) 5 C^) 

where the functional T((ct|x; 9) within the bracket in Eq. 6 can be abbreviated as 
H{a\x;9) 

1 

= {cos9ax+sm9ay)^ ■ — / M{x;9,tp)dtp 

^ J-n 

I j-K, I 

+2ag{cos9ax+sin9ay) ■ — / tpM{x;9,tp)dtp + ag^ • — / tp'^ M{x;9,tp)dtp 

^ J -K, ^ j —K 

= A{cos9ax+sm9ay)^ + 2B ag{cos9 ax+sm9 ay) + C ag“^ . (8) 

where A = 4/7 t, B = 0 and C = 4k^/7t. Its Euler-Lagrange equation can be given by 

8 Att 

COS^0(Ta;a; + 2sin0COS0O'j;y +sin^0cr„j, + K^(l y)agg = —a, (9) 

7T^ 4 

with the boundary condition a{x;9) = ao{x;9), if {x,9) ^ P x O. 

Numerically, the energy function is quadratic and the minimizer is straightforwardly 
obtained by rewriting Eq. 7 as a matrix form by the finite difference method, and solving 
it by a gradient-descent method. We call this minimizer a*{x;9). Some examples of 
a* (x; 9) are shown in Eig. 3(a) & (b). 

We avoid the degenerate case, k = 0, when the smoothness functional becomes 
^^gM(x,0,O)(cr(x,6) — a(x-t-cos0,t/ + sin6<,6))^,nolonger depending on , and the min- 
imization process can be worked separately through different 6 's. 
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Fig. 2. (a) a particle moves from Pq ^"1 with slightly changed in orientation. We say that the 
particle has the configuration changed from (x; 6) to {x + cos9,y + sin6; 6 + ip) as it goes from 
the point Pq to Pi (assume |Po — Pi | = 1). The shadow area indicates the possible trails that the 
particle may choose, with different weighting according to the deviation angle ip. (hi) one part of 
the hypothesis set ao (Eq. 1: part 1) with inward vectors of length e(x;0) represented by vector 
notations. Only the normal directions are shown. Other parts includes (b2) set of vectors of length 
0 with outward direction (Eq. 1: part 2) and (b3) decay vectors generally put all over the space 
(I-P) x6)(Eq. 3). 



2.3 Kullback-Leibler Measure 

Once the field a*{x;0) is obtained, we can evaluate those shape properties, such as 
convexity or symmetry. First of all, from the solution space 17 x 0 to the image space 
f7, we perform a vector summation considering cr* (x; 6 ) as a vector in the direction of 
ug = (cos 0, sin 0) with length a*{x;9) (always non-negative), 

/*27T 

(T*(x) = (T*^(x) = / a*{x- 0 )u 0 de, (10) 

Jo 

where direction of the resultant vector a*^ will be recorded by 0* , if it has non-zero length. 
We shall use ct* to denote the resultant vectors if no confusion can be made. One example 
of the vector field a* (x) is in Fig. 3(f2). The idea is that this vector summation will remove 
the symmetry information of the solution field, and thus what is left is “convexity”. 
For example, the center of a circle is perfectly symmetric and will result in ct* = 0 
while other internal points will have the symmetry removed and still produce evidence 
of convexity. In our approach, convexity works on the complementary information as 
symmetry does. Detecting symmetry will be done by investigating which coordinates 
have a vector summation yield zero value (cancellations), while convexity examines the 
non-zero valued information of the vector summation. Indeed, the convexity measure we 
propose measures how well the “vector bundle” is accumulated around this summation 
vector. Let us be more precise. 

At a fixed location x, for each pair of opposite orientations 0 and 0 + tt, we compute 
the net effect, called the map r* (x; 0) by 

r*(x;0) = i {ct*(x,0) -ct*(x,0 + 7t) + |ct*(x,0) - ct*(x,0 + 7t)|} 

r*(x;0 + 7r) = - {ct*(x,0 + 7t) — ct*(x,0) + |ct*(x,0 + 7t) — ct*(x,0)|} 




550 Hsing-Kuo Pao and Davi Geiger 




(a) 



(b) 



(c) 



(d) S = 0.449 (e) <S = 0.806 




(f) 



(fl) 



(f2) 



Fig. 3. Results of orientation diffusion process by two presentations, (a) a{x,y,0), (b) 
cr(x, j/; jgTr), (c) maximum magnitude among all orientations, and (d) (e) relative entropy 
D{Px\\Q) (Eq. 12) with A = 0.1/|17| = 2.876 x 10“^, 500/|J7j = 0.144 respectively. The dif- 
ference shows effect of the energy decay. Larger A gives stronger decay, less diffusion and larger 
convexity measure (Eq. 13). (fl) (f2) Vector presentations for part of the hell shape figure in (f). 
(fl) the vector bundles with choice of 8 orientations. We say that the point A has a more balanced 
(symmetry) result along different orientations than the bundle in B. (fl) the convexity vector a* 
(Eq. 10). 



= T*(x;0) - a(x;0) +a(x;0 + Tr) , 



( 11 ) 



where a cyclic boundary condition 0 = 0 + 2tt is used. Note that Eq. 10 can be rewritten 
as (7*(x) = fg^T*(x;0)ug d0. The second step is to transform t{x;9) to P*{x;0) or 
simply Px{0), by a linear transformation. It is 



P*(x;0) = i(l + r*(x;0)). 



Unlike any previous work, we propose a Kullback-Leibler distance (or relative en- 
tropy) to measure convexity. The map Px{0) is compared to a Gaussian map in (not 
normalized) 



QW = 



1 

1 + £iV 



exp(- 



( 0 - 0 *)^ 

2 -( 2 . 0)2 



— 7T < 0 < TT 



where the center of Q is located in the direction 0*, the direction of vector a* .A small 
positive constant is added to avoid the case of Q{0) = 1. The Kullback-Leibler 
distance, computed at every point x, between Q{0) and Px{0) is given by 
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Intuitively, it measures the inefficiency of assuming that the distribution is Q{9) when 
the real distribution is Pn{0). That is, we assume the map in has a sharp peak at 6* . 
D{P-s\\Q) is always equal to or greater than 0, and it can be greater than 1. A smaller 
measure means a better fit. 

This measure is given for each point x. To obtain a global measure of a shape, we 
simply average this measure over the points on the figure. I.e., the final measure of a 
shape Q is 

S{P,) = ^^j^D{P,\\Q)dx, (13) 

A shape with a smaller measure is considered more convex than others. A simple example 
can be seen in Fig. 4 as a F/G experiment. We use S/S to denote the measure for 
(conventional) figure/background area respectively. 



3 Convexity Measurement 

To examine the orientation model and the Kullback-Liebler measure, we conduct dif- 
ferent set of experiments. One set of images refers to the F/G separation task and we 
test ability of the model to account for human visual perception. The second set displays 
shape figures and the task is to order and compare them. For each set of examples we 
illustrate the results with several prototypes. The prototype illustrations for the F/G set 
are Fig. 4, Fig. 5 and Kanizsa’s figure of comparing convexity versus symmetry (Fig. 6). 
The prototypes for the second set are a (non-) perfect ellipse and a “bell” shape (shown 
on the right-hand side of Fig. 7). 

In Sec. 4, we consider another set studying behavior of the model as we vary the 
parameter n. This is particularly important to understand how the shape size plays a role 
in the model. 

We are also interested in the convexity comparison of perfectly convex shapes. Shapes 
like circles, squares, and triangles are lined up by the Kullback-Liebler measure. Also, 
the comparison of squares to rectangles will favor squares than rectangles. They will be 
shown in Sec. 5. 

In all cases we used synthesized binary images. The parameters we used are, the 
decay coefficient A = 0.1/|17|, divided by area of the region 1? and the deviation factor 



c 

(a) 

Fig. 4. A simple experiment used for F/G separation and its relative entropy D{Px\\Q). While right 
side is convex and left side is concave, we have S/S = 0.605/0.671 respectively. Both regions 
are equally sized. 
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K = 0.4 except in the cases that are said otherwise. The peak of the delta function is 
simulated by 1 /e with e = 10“® • |9l?|, a constant normalized by the perimeter of Q. The 
normalization of the parameters A or e according to area or perimeter of 17 respectively 
is not essential for the convexity measurement and can be removed. 

3.1 Convexity as Salience 

We test our measure for various prototypes. For a simple test in Fig. 4, the measure 
is given by 5/5 = 0.605/0.671, favoring the white region. For another F/G image in 
Fig. 5, we have the consistent predictions through images with different ratios of bk/wt 
area. Also, we do observe the size effect from (a) to (c): the differences between figures 
and background become less as black/white regions get smaller/larger respectively. For 
a Kanizsa figure in Fig. 6(a3) where the white regions own the biggest area among (al)- 
(a3), the result predicts that the white regions (convexity) prevail as the figure over the 
black regions (symmetry) with 5/5 = 0.490/0.497. It can not be given by the model 
provided by [16] which gives 5/5 = 0.715/0.645, saying that the black regions will be 
figures and it is a “wrong” prediction. 

The reason is that the decay model provided by Pao et al. [16] is too sensitive to 
the size and can not provide consistent predictions when we move the edge boundaries 
leftward or rightward, even the change is small and beyond the detection from our eyes. 
A better model, the oriented one will give correct predictions consistently, with shift of 
the edge boundaries. 

4 Convexity versus Size 

In the Euler-Lagrange equation in Eq. 9, the deviation factor k serves as the diffusivity 
in the orientation space O. In the random walk formulation, a larger k provides more 
noise and less deterministic behavior in orientation when it moves from one location to 
another neighboring location. A measure with small size preference is yielded in such a 
situation. On the other hand, a smaller k has more control in orientation. The measure 
will have less size preference and be closer to ideal convexity measure than the one given 
in the less-deterministic case. The most extreme case gives the size invariance measure, 
provided by letting k — > 0. It will be discussed later. Results related to different choices 
of K will be studied in Sec. 4.1 for F/G separation and Sec. 4.2 for shape convexity 
comparison respectively. The choice of k provides the flexibility of choosing convexity 

LXJJI MJUJjLXJ] 

(a) 0.520/0.587 (al) (6)0.522/0.585 (c) 0.525/0.583 

Fig. 5. Colonnade with boundaries translated through the horizontal direction from (a), a size 
balanced one to (c), with white regions owning the largest area. The relative entropy of (a) is 
shown in (al). Their differences between the measures in figure/ground get smaller from (a) to 
(c), as black/white regions get smaller/larger. It shows the size/proximity preference of the model. 
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measure between a size-invariant one and the one with small size preference. It cannot 
be achieved by the decay measure in [16]. We use the Kanizsa images in Fig. 6 (Sec. 4.1) 
and the ellipse/bell pair in Fig. 7 (Sec. 4.2) to illustrate our idea. 



Size Invariance by Letting > 0 Let us analyze the case when — > 0, where the walk 
is totally deterministic in orientation. By taking the limit k ^0, the functional in Eq. 8 
becomes 

4 2 

lim H(cr|x;0) = — (cos^CTj; -Fsin0cr„) . 

7T 

We want to examine the effect of shape scaling in this limit for the whole functional in 
Eq. 7. After that, we study the scaling effect on the convexity measure S. 

Given two similar shapes 1? and Q' in an image I, indicated by two characteristic 
functions x and x' s.t. x(x) = 1 or x^(x) = 1 if x G 17 or 17' respectively, and x(x) = 0 or 
x'(x) = 0 otherwise. Under an appropriately chosen coordinate system, we have x(x) = 
x'(rx) = x^(xO) Vx,x' G I. Given similar edge maps e(x;0) = e'(x';0) = ju^ -n(x)|, 
when K — 0 and A — > 0, the decay term will vanish and we have 

E{a') = E{a) = [ <52 • e ■ (ct — cto)^ H — {cosOax +sm9ay)‘^ dxdO , 

Jqx0 ^ 

if we assign cr'(x'; 0) = cr(x; 9). So the minimizer for either functional can be obtained 
easily when the other one has been computed. Two minimizers are related by cr* (x; 0) = 
a'* (x'; 9). Different orientations are worked independently here. 

Let us discuss the scaling effect on the convexity measure. Eor two similar maps 
a* , cr'*, we can obtain the similarity between and P^i or D{Px\\Q) and D{P^r ||Q). 
Therefore, for the convexity measure, we have 

s{P.') = -^ /^, IIQ) = ■ 

It is an ideal case where we can achieve a measure with no sensitivity to the size. 
Examples in the discrete case can be found in Fig. 7 and Sec. 4.2. When A is not zero, 
we need to set a new A' = A/r^ to achieve the similarity a* (x; 9) = ct'* (x'; 9). 

4.1 Figure/Ground Separation in Convexity-Symmetry Image and Tuning of k 

We will use the Kanizsa figures in Fig. 6 to illustrate our points. As we have discussed, 
the decay model cannot provide a consistent prediction when the area ratio of different 
regions is changed, even though the change is small and beyond the detection from our 
eyes. The orientation model will give a consistent prediction, with a small shift of the 
edge boundaries. The trick lies in the free parameter k which on one hand (small k or 
the extreme case with k ^ 0) gives a convexity measure without considering small size 
preference and on the other hand (big k) gives a measure with the size preference. In 
Tab. 1, we have series (al), balanced area for black and white regions, series (a2), larger 
white regions and (a3) white regions with the largest area. Several pairs of comparisons 
can be studied. When k = 0.1, the purest convexity measure among them, gives the 
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(al) (a2) (a3) (b3) 0.490/0.497 (c3) 0.653/0.576 



Fig. 6. Convexity measure for the Kanizsa figures (adapted from Kanizsa [9]). (b3) gives the 
relative entropy of (a3) and (c3) gives the decay diffusion field of (a3) (Pao et al. [16]). When the 
boundaries are translated from (a 1 ) to (a3 ) with white regions getting bigger, the convexity measure 
from the orientation diffusion gives S/S = 0.490/0.497 in (h3), by a test of k = 0.1 while in 
(c3), the measure from the PGR model shows S/S = 0.653/0.576, a “wrong” prediction. Note 
that the (non-oriented) decay diffusion process will have more diffusion near the neck area while 
in (b3), a bad relative entropy will be given with help of the oriented process. More discussion 
related to size and the choice of k can be seen in Tab. 1 . 



expected answer with white in the front, for all three cases. When k = n/2, a case 
blended with the most size preference among them, white regions in (al) & (a2) are 
preferred, but not for (a3) where the smaller black regions are preferred. The case of 
K = 0.4 gives an intermediate transition. For the perceptual simulation, suppose a case 
where F/G is decided by both of the convexity and size preferences, we can tune the 
parameter k to obtain the appropriate normalization of them, according to the preferences 
in human visual systems. 

4.2 Convexity Comparison of Shapes and Tuning of k 

We study our convexity measure for different shapes with the same or different sizes. As 
shown in Fig. 7, the imperfect ellipses Al(large) & A2(small) represent convex shapes, 
and the bell shapes (large) & i?2(small) represent non-convex shapes with “strong 
proximity” in the neck area. For the choice of k, a similar scenario is used as in the 
F/G separation experiments. When k is small, we pick shape by convexity and when 
K is larger, we either pick the smaller size shapes or pick the bell shape with stronger 
proximity. The details are shown in Fig. 7. 

Convexity Comparison of Shapes with the Same Size. When two shapes share the same 
size, we can easily pick up the ellipse as the more convex shape by choosing k < 0.6. 
A large k will favor proximity and therefore pick the bell as the more favorable shape. 
We have < Sbi and Sa 2 < Sb 2 when n < 0.6. 

Convexity Comparison of Shapes with Any Sizes. To make the measure capable to be 
applied in a wider range, we compare shapes with different sizes, achieved by using an 
even smaller k < 0.4. In this case, no size/proximity issues need to be considered and 
the measure becomes a “convexity measure”. We have 5 . 41 , Sa 2 < Sbi,Sb 2 if k < 0.4 
is provided. The smallest k or k = 0.1 provides a measure with the least sensitivity to 
the size. As we can see, the values of similar shapes with different sizes are coincided 
to each other in this situation. 
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Table 1. Kanizsa figures with various size ratios (Fig. 6), measured by the orientation diffusion and 
the PGR model [16]. The series from (al) to (a3) gives the white region area from small to large. 
Series (al) has balanced size. For the orientation diffusion, a smaller k gives a purer convexity 
measure and a larger k , meaning more uncertainty in orientation in walks, will cause a stronger 
size preference. The prediction for F/G is likely to be decided by convexity when k is small and be 
decided by size/proximity when k is large. We can get convex predictions for F/G when k < 0.1. 
The measure collected by the PGR model (result of (a3) is shown in Fig. 6(c3)) is pretty much a 
size measure if we do not provide a size-balanced experiment. 



Series (Area) 


fv = 0.1 


k = 0.4 


K = 7T j2 


Remark 


Convexity Measure by Orientation Diffusion (S /S) 


(al)(-f0/-0%) 
(a2) (-f6/-0%) 
(a3)(-f28/-0%) 


0.487/0.502 

0.488/0.500 

0.490/0.497 


0.490/0.499 

0.492/0.497 

0.495/0.491 


0.560/0.570 

0.564/0.565 

0.573/0.555 


size balanced 


Convexity Measure by PGR model (S /S) 


(al)(-f0/-0%) 




0.621/0.653 




size balanced 


(a2) (-p6/-0%) 




0.638/0.635 






(a3)(-f28/-0%) 




0.671/0.597 







Large k and Size/Proximity Preference. When k, is large, it is the case of diffusion with 
low certainty in orientation. We will have the result favoring smaller size shapes, as 
we have iSa2,'5b2 < '5ai,5bi- When similar size shapes are compared, a shape with 
stronger proximity will prevail. Therefore, the bell Blis favored over the ellipse Al and 
B2 is favored over A2, by having smaller measures. 



5 Comparison of Convex Shapes 

5.1 From Rectangle to Square 

In Fig. 8, we run the model on a series of rectangles with ratio of side length from 0 
to 1. As shown in the experiments, the closer to square the rectangle is, the better the 
convexity measure. The strong symmetry in the thin rectangles will be cancelled by the 
computation of Eq. 1 1 . 



5.2 From Triangle to Circle 

In Fig. 9, we examine various regular shapes. Unlike the model provided by [16] giving 
a measure with the preference from triangle to circle, this orientation model gives a 
reverse order, prefers circle most, then, hexagon and triangle. By using the Kullback- 
Leibler measure, diffusion from neighboring inducers with similar orientations will 
accumulate around direction of the resulting vector 6*, hence, giving a good score. For a 
cross check, the square in Fig. 8(d) has ^hexagon < >5square < ‘^triangle, falling between 
the triangle and hexagon. A preference similar to the one given by the PGR model [16] 
can be achieved by a simulation with a large k. 
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Relative Entropy for Various Shapes 




k = 0.4 





5^1 = 0.432 




<5^1 = 0.650 



Sbi =0.449 =0.619 



Fig. 7. A comparison between shapes with different sizes and convexity. Four curves listed in 
K = 7t/ 2, from top to bottom are for the (imperfect) ellipse Al, bell Bl, small ellipse A2 and 
small bell B2 (shapes Al , B 1 are shown on the right). A measure with small k will pick shapes by 
convexity. A measure with large k will pick shapes by size/proximity; therefore, either small sized 
shapes or the bells which own strong proximity near the neck will be favored. When k = 0.1, the 
case with the least size preference, the measures for all similar shapes with different sizes will 
coincide with each other. The level sets or the relative entropy for Al and B 1 with the simulations 
of K = 0.4 and k = 7t/ 2 are shown on the right-hand side. The preference between the ellipse and 
bell switches as we goes from a small k to a large k. 



5.3 Convexity Is Different from Shape Compactness 

It is interesting to compare our measure with the compactness CP generally given 
by CP (91?) = |9l7|^/(47T • |1?|). As we can easily find out, the same preference for 
regular shapes and the preference of squares over rectangles are also established by the 
compactness CP. However, the difference between them is shown in various viewpoints: 

1. Unlike the shape convexity, the compactness will favor a region with large area 
instead of small area provided regions with the same perimeter. However, the small 
size preference does exsit in F/G selection, which can be simulated by our convexity 
measure (Fig. 5, Fig. 6 and Tab. 1). 

2. A circle-like shape with fine scale perturbation may have larger compactness than a 
smooth-boundary, non-circular shape (Young et al. [21], Bribiesca [2], Fig. 10 (a) 

& (b)). 

3. Our measure has the flexibility to switch between a size-invariant convexity measure 
or one with small size preference. Also, the freedom to choose the preference of 
circles over triangles (small k ) or triangles over circles (large k ), etc. 

4. It is hard to generalize the shape compactness in the applications where the shape 
is partial or occluded. 
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I I I2 

(a) 

Fig. 8. (a) rectangle with ratio ll to I2. (b) (c) (d) samples of different ratios I2/I1 and the relative 
entropy D{Px\\Q). Their measures are shown, given the parameter k = 0.1. 




(a) S = 0.422 (b) S = 0.410 (c) S = 0.407 



Fig. 9. The relative entropy or corresponding iso-contours derived by our orientation process. Their 
measures are shown, with the parameter set to be k = 0.1. Note that the square in Fig. 8(d) falls 
between triangle and hexagon. The area information is provided as (a) triangle: 897, (b) hexagon: 
1361, and (c) circle: 1469 pixels. 



□ □□ 

(b) S = 0.437 (c) S = 0.420 (d) S = 0.417 



Some examples to illustrate the difference between the compactness and the convexity 
can be shown in Fig. 10. For the shape (a), perturbations in the fine scale cause a smaller 
change in the convexity measure, than in the compactness. 



6 Symmetry Information 



The idea that symmetry can be captured by our model is not particularly new or surprising. 
For example, based on the work of Siddiqi and Kimia [19], we know that the diffusion 
process over shapes will “meet” (yield shocks) at the symmetry axis. What is particular 
to our work, is that the diffusion is on the space x and presented by a vector form, 
so the symmetry information can be extracted by a local computation method. 

Let us collect the maximum vectors (x) by 

^m(x) = ct*(x;0m)'Um where cr*(x;6»M)= max {cr*(x;0)}, 

es[ 0 , 27 r) 



with um = (cos sin 0 m)- We would like to examine the field 



a 



sym 






" M I 



(14) 



The shape axis (different from the conventional symmetry axis [15]) is viewed as the 
place with most sinkage, in the normalized map of the maximum vectors. The map shows 
“how likely” a point can be on any axis or “how strong” the axis is. Examples of shape 
axis can be found in Fig. 1 1 . 
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Fig. 10. Convexity measure for various shapes with different convexity in different scales, from 
the most convex to the least convex one (k = 0.1). 





Fig. 11. Shape axis (7 



sym 



(Eq. 14), computed by using the maximum vector cr^, with k = tt/2. 






6.1 Protrusion 

To see what protrusion or noise on boundaries can affect the shape axis, we study the 
axes of shapes in Fig. 12. In (bl), the main axis stays the same as we perturbing the 
boundaries from (a) by a small protrusion. But a larger effect can (b2) eventually produce 
a new axis or (b3) even change location of the main axis. By this representation, any 
further pruning of the “sub-axis” will not be necessary. Because sub-axes are given in 
“lighter” representations and can be easily removed. 

7 Conclusion 

We have proposed a framework to capture critical shape information: namely convexity, 
symmetry and size. It is an orientation diffusion approach, rooted in Markov random 
fields. The focus of the paper was on demonstrating how the model evaluates convexity, 
and how size is integrated into the model depending upon a parameter k ,. Although it is 
not surprising, we have indicated how symmetry can also be captured. The protrusion 
effect on boundaries was naturally solved by this approach. We have demonstrated the 
validity of the model on various examples and how it improves on previous works. We 
believe that this approach provides a wealth of information on shapes, and thus, will be 
used by other practitioners. 
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(a) (bl) (b2) (b3) 

Fig. 12. Shape axis with a protrusion on boundaries. We have the transition from (bl) ignored, 

(b2) having “sub-axis” to (b3) disturbing the main axis by a faction. We choose the maximum 

vector and select the deviation factor k = it 12. 
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Abstract. We address the problem of finding a set of contour curves in 
an image. We consider the problem of perceptual grouping and contour 
completion, where the data is a set of points in the image. A new method 
to find complete curves from a set of contours or edge points is presented. 
Our approach is an extension of previous work on finding a set of contours 
as minimal paths between end points using the fast marching algorithm. 
Given a set of key points, we find the pairs of points that have to be 
linked and the paths that join them. We use the saddle points of the 
minimal action map. The paths are obtained by backpropagation from 
the saddle points to both points of each pair. 

We also propose an extension of this method for contour completion 
where the data is a set of connected components. We find the minimal 
paths between each of these components, until the complete set of these 
“regions” is connected. The paths are obtained using the same backprop- 
agation from the saddle points to both components. 

Keywords: Perceptual grouping, salient curve detection, active con- 
tours, minimal paths, fast marching, level sets, weighted distance, recon- 
struction, energy minimization, medical imaging. 



1 Introduction 

We are interested in perceptual grouping and finding a set of curves in an image 
with the use of energy minimizing curves. 

Since their introduction, active contours [12] have been extensively used to 
find the contour of an object in an image through the minimization of an energy. 
In order to get a set of contours of different objects, we need many active contours 
to be initialized on the image. The level sets paradigm [15,1] allowed changes 
in topology. It enables to get multiple contours by starting with a single one. 
However, these do not give satisfying results when there are gaps in the data 
since the contour may propagate into a hole and then split to many curves where 
only one contour is desired. This is the problem encountered with perceptual 
grouping where a set of incomplete contours is given. For example, in a binary 
image like the ones in figure 1 with a drawing of a shape with holes and spurious 
edge points, human vision can easily fill in the missing boundaries, remove the 
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spurious ones and form complete curves. Perceptual grouping is an old problem 
in computer vision. It has been approached more recently with energy methods 
[17,11,18]. These methods find a criteria for saliency of a curve component or 
for each point of the image. In these methods, the definition of saliency measure 
is based indirectly on a second order regularization snake-like energy ([12]) of a 
path containing the point. However, the final curves are obtained generally in 
a second step as ridge lines of the saliency criteria after thresholding. In [19] a 
similarity between snakes and stochastic completion field is reported. Motivated 
by this relation between energy minimizing curves like snakes and completion 
contours, we are interested in finding a set of completion contours on an image 
as a set of energy minimizing curves. 

In order to solve global minimization for snakes, the authors of [6] used the 
minimal paths, as introduced in [14,13]. The goal was to avoid local minima 
without demanding too much on user initialization, which is a main drawback of 
classic snakes [4] . Only two end points were needed. The numerical method has 
the advantage of being consistent (see [6]) and efficient using the Fast Marching 
algorithm introduced in [16]. In this paper we propose a way to use this minimal 
path approach to find a set of curves drawn between points in the image. In 
order to find a set of most salient contour curves in the image, we draw the 
minimal paths between pairs of points. 

We are also interested in finding a set of curves in an image between a set 
of connected components. This extension of our perceptual grouping technique 
finds its application in the completion of tube-like structures in images. The 
problem is here to complete a partially detected object, based on a number of 
connected components that belong to this object. 

In our examples, the potential P to be minimized along the curves is usually 
an image of edge points that represent simple incomplete shapes. These edge 
points are represented as a binary image with small potential values along the 
edges and high values at the background. Such a potential can be obtained from 
real images by edge detection (see [5]). The potential could also be defined as 
edges weighted by the value of the gradient or as a function of an estimate of the 
gradient of the image itself, P ~ (;(||V/||), like in classic snakes. In these cases 
the chosen function has to be such that the potential is positive everywhere, and 
it has to be decreasing in order to have edge points as minima of the potential. 
The potential could also be a grey level image as in [6]. It can also be a more 
complicated function of the grey level, as in [7] or in Section 4.4. 
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The problems we solve in this paper are presented as follows: 

— Minimal path between two points: The solution proposed in [6] is reviewed 
in Section 2. 

— Minimal paths between a given set of pairs of points is a simple application 
of the previous one. 

— Minimal paths between a given set of unstructured points: a way to find the 
pairs of linked neighbors and paths between them is proposed in Section 3. 

— Minimal paths between a given set of connected components: we propose to 
find the set of minimal paths that link altogether this set of regions, and we 
show an example for a medical image in section 4. 

— Minimal paths between an unknown set of point: In [2] we propose the 
automatic finding of key points among a larger set of admissible points and 
the drawing of minimal paths that leads to completed curves. 

2 Minimal Paths and Weighted Distance 

2.1 Global Minimum for Active Contours 

We present in this section the basic ideas of the method introduced in [6] to 
find the global minimum of the active contour energy using minimal paths. The 
energy to minimize is similar to classical deformable models (see [12]) where it 
combines smoothing terms and image features attraction term (Potential P): 

EiC)=jj^wi\\C\s)\\\w2\\C''{s)f+P{C{s))ys ( 1 ) 

where C{s) represents a curve drawn on a 2D image and J? is its domain of 
definition. The authors of [6] have related this problem with the recently intro- 
duced paradigm of the level-set formulation. In particular, its Euler equation 
is equivalent to the geodesic active contours [1]. The method introduced in [6] 
improves energy minimization because the problem is transformed in a way to 
find the global minimum. 



2.2 Problem Formulation 

In [6], contrary to the classical snake model (but similarly to geodesic active 
contours), s represents the arc- length parameter, which means that ||C"(s)|| = 
1, leading to a geometric energy form. Considering a simplified energy model 
without the second derivative term leads to the expression E{C) = /{u;||C"||^ + 
P{C)]ds. Assuming that ||C"(s)|| = 1 leads to the formulation 

E{C)= [ {w + P{C{s))}ds (2) 

Jn=[o,L] 

The regularization of this model is now achieved by the constant tc > 0 (see [6] 
for details). 
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Fig. 2. Finding a minimal path between two points. On the left, the potential is minimal 
on the ellipse. In the middle, the minimal action or weighted distance to the marked 
point. On the right, minimal path using backpropagation from the second point 



We now have an expression in which the internal energy can be included in 
the external potential. Given a potential P > 0 that takes lower values near 
desired features, we are looking for paths along which the integral of P = P + w 
is minimal. The surface of minimal action U is defined as the minimal energy 
integrated along a path between a starting point po and any point p: 

U{p) = inf E{C) = inf | [ P{C{s))ds] (3) 

•^PO'P *^po>p K-Jo J 

where Ap„^p is the set of all paths between po and p. The minimal path between 
Po and any point pi in the image can be easily deduced from this action map. 
Assuming that potential P ^ 0 (this is always the case for P), the action map 
has only one local minimum which is the starting point po ■ The minimal path is 
found by a simple back-propagation, that is a gradient descent on the minimal 
action map U starting from pi until po is reached. Thus, contour initialization 
is reduced to the selection of the two extremities of the path. We explain in the 
next section how to compute efficiently the action map U. 



2.3 Fast Marching Resolution 



In order to compute this map U, a front-propagation equation related to Equa- 
tion (3) is solved: 



^ - 1 
dt ~ p 



( 4 ) 



It evolves a front starting from an infinitesimal circle shape around po until each 
point inside the image domain is assigned a value for U. The value of U{p) is the 
time t at which the front passes over the point p. 

The Fast Marching technique, introduced in [16], was used in [6] noticing 
that the map U satisfies the Eikonal equation: 



VU\\=P and U{po) = 0. 



( 5 ) 



Classic finite differences schemes for this equation tend to overshoot and are 
unstable. An up- wind scheme was proposed by [16]. It relies on a one-sided 



564 



Laurent D. Cohen and Thomas Deschamps 



Table 1. Fast Marching algorithm 



Algorithm for 2U h'ast Marching 

— Definitions: 

• Alive set: all grid points at which the action value U has been reached 
and will not be changed; 

• Trial set: next grid points (4-connexity neighbors) to be examined. 
An estimate U oiU has been computed using Equation (6) from alive 
points only (i.e. from 14); 

• Far set: all other grid points, there is not yet an estimate for U; 

— Initialization: 

• Alive set: reduced to the starting point po, with U{po) = l4{po) = 0; 

• Trial set: reduced to the four neighbors p of po with initial value 
U{p) = P{p) [U{p) = c»); 

• Far set: all other grid points, with 14 = U = oo; 

— Loop: 

• Let p = {imin,jmin) be the Trial point with the smallest action U; 

• Move it from the Trial to the Alive set (i.e. L4{p) = is 

frozen) ; 

• For each neighbor (i,j) (4-connexity in 2D) of {imin, jmin)'- 

* If {i,j) is Far, add it to the Trial set and compute Uij using Eqn. 

6; 

* If (i,j) is Trial, update the action Ui,j using Eqn. 6. 



derivative that looks in the up- wind direction of the moving front, and thereby 
avoids the over-shooting associated with finite differences: 

(max{u — I4^_i j,u — 14^+ij, 0})^ + 

(max{-u (6) 

giving the correct viscosity-solution u for The improvement made by the 
Fast Marching is to introduce order in the selection of the grid points. This order 
is based on the fact that information is propagating outward, because the action 
can only grow due to the quadratic Equation (6). 

This technique of considering at each step only the necessary set of grid 
points was originally introduced for the construction of minimum length paths 
in a graph between two given nodes in [8] . 

The algorithm is detailed in Table 1. An example is shown in Figure 2. The 
Fast Marching technique selects at each iteration the Trial point with minimum 
action value. In order to compute this value, we have to solve Equation (7) for 
each trial point, as detailed in Table 2. 



2.4 Algorithm for 2D Up-wind Scheme 

Notice that for solving Equation (6), only alive points are considered. This 
means that calculation is made using current values of 14 for neighbors and 
not estimate U of other trial points. Considering the neighbors of grid point 
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{i,j) in 4-connexity, we note {A\, A 2 } and {Si, B 2 } the two couples of opposite 
neighbors such that we get the ordering U{A\) < U{A 2 ), U{B\) < U{B 2 ), and 
^(^ 1 ) < Considering that we have u > U{Bi) > U{Ai), the equation 

derived is 

{u-U{A^)f + {u-U{Br)f = Pl (7) 

Computing the discriminant A of Equation (7) we have the steps described in 
Table 2. 



Table 2. Solving locally the upwind scheme 



T — If Z\ > 0, n should be the largest solution of Equation (7); 

• If the hypothesis u > U{Bi) is wrong, go to 2; 

• If this value is larger than lA(Bi), this is the solution; 

— If Zl < 0, -Bi has an action too large to influence the solution. It means 
that u > U{Bi) is false. Go to 2; 

Simple calculus can replace case 1 by the test: 

Ibis. If Pi.i >U{Bi) -U{AA, 

W(Bi)+W(Ai)+.y2p2 _(W(Bi)-M(Ai))2 

u = — 2 ^ is the largest solution of Equa- 

tion (7) 
else go to 2; 

2. Considering that we have u < U{Bi) and u > U{Ai), we finally have 

u = lijAi) -f Pj,j. 



Thus it needs only one pass over the image. To perform efficiently these 
operations in minimum time, the Trial points are stored in a min-heap data 
structure (see details in [16]). Since the complexity of the operation of changing 
the value of one element of the heap is bounded by a worst-case bottom-to-top 
proceeding of the tree in 0(log2 T), the total work is bounded 0(Flog2 P) for 
the Fast Marching on a grid with P nodes. 

3 Finding Multiple Contours from a Set of Key Points pk 

The method of [6] , detailed in the previous section allows to find a minimal path 
between two endpoints. We are now interested in finding many or all contours 
in an image. A first step for multiple contours finding in an image is to assume 
we have a set of points given on the image and then find eontours passing 
through these points. We assume the points are either given by a preprocessing 
or by the user. We propose to find the contours as a set of minimal paths that 
link pairs of points among the PkS. If we also know which pairs of points have 
to be linked together, finding the whole set of contours is a trivial application 
of the previous section. This would be similar to the method in [10] which used 
a dynamic programming (non consistent, see [6]) approach to find the paths 
between successive points given by the user. The problem we are interested in 
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here is also to find out which pairs of points have to be connected by a contour. 
Since the set of points pfc’s is assumed to be given unstructured, we do not know 
in advance how the points connect. This is the key problem that is solved here 
using a minimal action map. 

3.1 Main Ideas of the Approach 

Our approach is similar to computing the distance map to a set of points and 
their Voronoi diagram. However, we use here a weighted distance defined through 
the potential P. This distance is obtained as the minimal action with respect to 
P with zero value at all points pk- Instead of computing a minimal action map for 
each pair of points, as in Section 2, we only need to compute one minimal action 
map in order to find all paths. At the same time the action map is computed we 
determine the pairs of points that have to be linked together. This is based on 
finding meeting points of the propagation fronts. These are saddle points of the 
minimal action U. In Section 2, we said that calculation of the minimal action 
can be seen as the propagation of a front through equation (4). Although the 
minimal action is computed using fast marching, the level sets of Id give the 
evolution of the front. During the fast marching algorithm, the boundary of the 
set of alive points also gives the position of the front. In the previous section, we 
had only one front evolving from the starting point po- Since all points pk are 
set with Id(pk) = 0, we now have one front evolving from each of the starting 
points pfc. In what follows when we talk about front meeting, we mean either 
the geometric point where the two fronts coming from different pk’s meet, or in 
the discrete algorithm the first alive point which connects two components from 
different p^’s (see Figures 3 and 4). 

Our problem is related to the approach presented at the end of [6] in order to 
find a closed contour. Given only one end point, the second end point was found 
as a saddle point. This point is where the two fronts propagating both ways 
meet. Here we use the fact that given two end points pi and p 2 , the saddle point 
S where the two fronts starting from each point meet can be used to find the 
minimal path between pi and p 2 - Indeed, the minimal path between the two 
points has to pass by the meeting point S. This point is the point half way (in 
energy) on the minimal path between pi and p 2 - Backpropagating from S to 
Pi and then from S to p 2 gives the two halves of the path. This is in fact an 
approximation, due to some discretization error in finding the meeting point S. 
If high precision is needed, a subpixel location of saddle points can be made 
based on the final energy map. In order to get the precise minimal path between 
the two points, we could also backpropagate from the second point to the first 
as in Section 2, but computation time would be then much increased. 

3.2 Some Definitions 

Here are some definitions that will be used in what follows. 

— For a point p in the image, we note Up the minimal action obtained by Fast 
Marching with potential P and starting point p. 
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Fig. 3. Ellipse example with four points. On the left the incomplete ellipse as potential 
and four given points; on the right the minimal action map (random LUT to show the 
level sets) from these points 



— X being a set of points in the image, Ux is the minimal action obtained 
by Fast Marching with potential P and starting points {p, 'p G X}. This 
means that all points of X are initialized as alive points with value 0 and all 
their 4-connexity neighbors are trial points. This is easy to see that Ux = 

minpgjc Wp. 

— The region associated with a point p^ is the set of points p of the image 
closer in energy to pk than to other points pj . This means that minimal action 

— ^Pj^j 7 ^ Thus, if X = {pj,0 < j < N}, we have Ux = Up,^ on 
and the computation of Ux is the same as the simultaneous computation of 
each Up^ on each region 7?^. These are the simultaneous fronts starting from 
each pk ■ 

— The region index r is r{p) = k,\/p G R^. (Voronoi Diagram for weighted 
distance) . 

— A saddle point S{pi,pj) between pi and pj is the first point where the front 
starting from p, to compute Up. meets the front starting from pj to compute 
Upp, At this point. Up. and Up. are equal and this is the smallest value for 
which they are equal. 

— Two points among the pkS will be called linked neighbors if they are selected 
to be linked together. The way we choose to link two points is to select some 
saddle points. Thus points p.i and pj are linked neighbors if their saddle point 
is among the selected ones. 



3.3 Saddle Points and Reconstruction of the Set of Curves 

The main goal of our method is to obtain all significant paths joining the given 
points. However, each point should not be connected to all other points, but 
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Fig. 4. Zoom on a saddle point between two key points 



only to those that are closer to them in the energy sense. In order to form closed 
curves, each point Pk should not have more than two linked neighbors. The 
criteria for two points p, and pj to be connected is that their fronts meet before 
other fronts. It means that their saddle point S{pi,pj) has lower action U than 
the saddle points between these points and other points pk- The fact that we 
limit each pk to have no more than two connections makes it possible that some 
points will have only one or no connection. This helps removing some isolated 
spurious points or getting different closed curves not being connected together. 
We illustrate this in the example of Figure 6 where one of the pk is not linked 
to any other point since all the other points already have two linked neighbors. 
In case we also need to have T-junctions, the algorithm can be used with a 
higher number of linked neighbors allowed for each endpoint or we may connect 
together all possible points as in Section 4.3. A non symmetric relation may also 
be used to link each point to the closest or the two closest ones, regardless of 
whether these have already two or more neighbors. In the exemple of Figure 6, 
such an approach would link the spurious points with the circles. Postprocessing 
would be needed to remove undesired links, based on high energy for example. 

Once a saddle point S{pi,pj) is found and selected, backpropagation rela- 
tively to final energy U should be done both ways to pi and to pj to find the 
two halves of the path between them. We see in Figure 5 this backpropagation 
at each of the four saddle points. At a saddle point, the gradient is zero, but the 
direction of descent towards each point are opposite. For each backpropagation, 
the direction of descent is the one relative to each region. This means that in 
order to estimate the gradient direction toward p,, all points in a region different 
from Ri have their energy put artificially to oo. This allows finding the good di- 
rection for the gradient descent towards pi. However, as mentioned earlier, these 
backpropagations have to be done only for selected saddle points. In the fast 
marching algorithm we have a simple way to find saddle points and update the 
linked neighbors. 

As defined above, the region Rk associated with a point pk is the set of 
points p of the image such that minimal energy Up^. (p) to pk is smaller than all 
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Fig. 5. Ellipse example with four points. On the left the saddle points are found, and 
backpropagation is made from them to each of the two points from where the front 
comes; on the right, the minimal paths and the Voronoi diagram obtained 



the (p) to other points pj. The set of such regions Rk covers the whole image, 
and forms the Voronoi diagram of the image (see figure 5). All saddle points are 
at a boundary between two regions. For a point p on the boundary between Rj 
and R/c, we have lAp^{p) = Up.{p). The saddle point S{pk,pj) is a point on this 
boundary with minimal value of Up,^ (p) = Up. {p) . This gives us a rule to find the 
saddle points during the fast marching algorithm. 

Each time two fronts coming from pk and pj meet for the first time, we define 
the meeting point as S{pk,pj). This means that we need to know for each point 
of the image from where it comes. This is easy to keep track of its origin by 
generating an index map updated at each time a point is set as alive in the 
algorithm. Each point p^ starts with index k. Each time a point is set as alive, 
it gets the same index as the points it was computed from in formula (6). In 
that formula, the computation of Ui^j depends only on at most two of the four 
pixels involved. These two pixels, said Ai and have to be from the same 
region, except if {i,j) is on the boundary between two regions. If Ai and B\ are 
both alive and with different indexes k and I, this means that regions Rk and 
R[ meet there. If this happens for the first time, the current point is set as the 
saddle point S{pk,pi) between these regions. A point on the boundary between 
Rk and Ri is given the index of the neighbor point with smaller action A\. At 
the boundary between two regions there can be a slight error on indexing. This 
error of at most one pixel is not important in our context and could be refined 
if necessary. 

3.4 Algorithm and Results 

The algorithm for this section is described in Table 3 and illustrated in figures 
3 and 5. When there is a large number of pk’s, this does not change much the 
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Table 3. Algorithm of Section 3 



Algorithm with previously defined pk 

— Initialization: 

• pk’s are given 

• \/k,V{pk) = 0;r(pfc) = fe; pk alive. 

• Vp ^ {pk}, V {pk) = oo; r{p) = —1; p is far except 4-connexity neighbors 
of Pk ’s that are trial with estimate U using Eqn. (6) . 

— Loop for computing V = lf{p^,o<k<N}- 

• Let p = {imin, jmin) be the Trial point with the smallest action U; 

• Move it from the Trial to the Alive set with V(p) = U(p); 

• Update r{p) with the same index as point Ai in formula (6). If r{Ai) ^ 
r(B\) and we are in case 1 of Table 2 where both points are used and if 
this is the first time regions r{Ai) and r{Bi) meet, S{pr(Ai),Pr(Bi)) = 
p is set as a saddle point between Pr(Ai) and Pr{Bi)- If these points 
have not yet two linked neighbors, they are put as linked neighbors and 
S(pr(Ai),Pr(Bi)} = p IS Selected, 

For each neighbor (i,j) (4-connexity) of (imin, jmin). 

* If (i,j) is Far, add it to the Trial set and compute U using Eqn. (6); 
!i< If (i,j) is Trial, recompute the action Uij, and update it. 

— Obtain all paths between selected linked neighbors by backpropagation each 
way from their saddle point (see Section 3.3). 



computation time of the minimal action map, but this makes more complex 
dealing with the list of linked neighbors and saddle points. This may generate 
more conflicting neighbor points, and due to the constraint of having at most two 
linked neighbors, some gaps may remain between contours. The method can be 
applied to a whole set of edge points or points obtained through a preprocessing. 
However, choosing few key points simplifies the computation of saddle points 
and linked neighbors and the geometry of the paths. When there are few key 
points, they are not too close to each other. Finding all paths from a given set 
of points is interesting in the case of a binary potential defined, like in Figure 
3, for perceptual grouping. It can be used as well when a special preprocessing 
is possible, either on the image itself to extract characteristic points or on the 
geometry of the initial set of points to choose more relevant points. In [2] we give 
a way to find automatically a set of key points from a larger set of “admissible 
points” . 

We show in figure 6 the results of the approach for the data given in figure 1. 
We show in figure 7 an application of our approach combined with the saliency 
map of [11]. In such an example, the given dots are too few to enable finding 
the ellipse as a minimal path. Indeed, taking two opposite points on the ellipse, 
the minimal path between them will not be along the ellipse but rather along a 
straight line. By passing through low potential points (in black) along the ellipse, 
the path will also pass through more points with high potential (background in 
white). Thus applying the method of [11] gives a saliency map that is much 
more dense than the original image. Taking the saliency map as potential, our 
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Fig. 6. Final paths obtained for images of Figure 1 




Fig. 7. Finding a set of minimal path using a saliency map as potential. From left to 
right, original data, saliency map, ridge lines and minimal paths 



approach allows finding the whole ellipse as a set of minimal paths between 
points determined automatically. The set of points to be linked here can be 
either the initial set of points of the original image or the set of points obtained 
by thresholding the saliency map. We can also find the ellipse by looking for ridge 
curves on the saliency map but there are many spurious ridge curves obtained. 

4 Extension to Finding Multiple Contours from a Set 
of Connected Components 

4.1 Minimal Path between Two Regions 

The method of [6] , detailed in the previous section allows to find a minimal path 
between two endpoints. This is a straightforward extension to define a minimal 
path between two regions of the image. Given two connected regions of the image 
Rq and R\, we consider Rq as the starting region and R\ as a set of end points. 
The problem is then finding a path minimizing energy among all paths with 
start point in Rq and end point in Ri. The minimal action is now defined by 

U{p) = inf E{C) = inf inf E{C) (8) 

PO^^O -^PQiP 

where Aho,p is the set of all paths starting at a point of Rq and ending at p. 
This minimal action can be computed the same way as before in Table 1, with 
the alive set initialized as the whole set of points of Rq, with 14 = 0 and trial 
points being the set of 4-connexity neighbors of points of Rq that are not in Rq. 
Backpropagation by gradient descent on 14 from any point p in the image will 
give the minimal path that join this point with region Rq. 
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In order to find a minimal path between region Ri and region Rq , we deter- 
mine a point Pi e Ri such that U{pi) = min^g/^j U(p)- We then backpropagate 
from Pi to Rq to find the minimal path between pi and Rq, which is also a 
minimal path between Ri and Rq. 

4.2 Minimal Paths from a Set of Connected Components 

Here, our approach is to compute the distance map to a set of unstructured set 
of points where connected points are considered as regions, using a weighted 
distance defined through the potential P. The set of paths is obtained with the 
minimal action with respect to P with zero value at all regions Rt- This method 
has the same possibilities as the one presented in section 3, as we only need to 
compute one minimal action map in order to find all paths. In the same way, we 
find the saddle points between each pair of propagating fronts from two regions 
Ri and i? 2 - And we compute the minimal paths between two regions Ri and 
i ?2 by back-propagating from each selected saddle-point until we meet a pixel 
belonging respectively to regions Ri and i? 2 - The notion of linked neighbors is 
extended to linked regions. More details can be found in [3]. 



4.3 Connecting All Regions with Minimal Paths 

The goal here is to connect all regions, but only to those that are closer to them 
in the energy sense. In Section 3, we were looking for closed contours from a 
set of points, and had constraints on the maximal number of linked neighbors. 
In the application we show below, we are interested in connecting all initial 
given points through a path. In this case we also need to have T-junctions, for 
reconstructing tree-like structures, therefore the algorithm can be used with a 
higher number of linked regions allowed for each region, as said for the end 
points in Subsection 3.3. The constraint we use now is to avoid creating a closed 
contour when connecting two regions. This is obtained through the definition of 
cycles and cycle tests. 

A cycle is a sequence of different regions < k < K, such that for 

1 < k < K — 1, Rk and Rk+i are linked regions and and Ri are also linked 
regions. A cycle test can be easily implemented using a recursive algorithm. 
When two regions Ri and Rj are willing to be connected - ie that their fronts 
meet - a table storing the connectivity between each region enables to detect if 
a connection already exists between those regions. Having N different regions, 
we fill a matrix M{N, N) with zeros, and each time two regions Ri and Rj meet 
for the first time, we set M{i,j) = = 1- Thus, when two regions meet, 

we apply the algorithm detailed in Table 4. 

If two regions are already connected, the pixel where their fronts meet is not 
considered as a valuable candidate for back-propagation. 

The algorithm stops automatically when all regions are connected. 

Once a saddle point between any region Ri and Rj is found and selected, 
back-propagation relatively to final energy U should be done both ways to pi 
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Table 4. Cycle detection 



Algorithm for Cycle detection when a region Hi meets a region Kj-. 

Test(i,j, M, i) \ with 

Test(i,j,M,l)-, 

— if M{l,j) = 1, return 1; 

— else 

• count =0; 

• for fc £ [1, -^] with k ^ i,k ^ j,k ^ I : 

if M(k,j) = 1, count + = Test{j,k, M ,l)\ 

• return count; 



and to pj , the first points belonging to respectively and Rj , during the back- 
propagation algorithm. 



4.4 Medical Application 

The method can be applied to connected components from a whole set of edge 
points or points obtained through a preprocessing. Finding all paths from a given 
set of points is interesting in the case of a binary potential defined, like in Figure 
3, for perceptual grouping. It can be used as well when a special preprocessing 
is possible, either on the image itself to extract characteristic points or on the 
geometry of the initial set of points to choose more relevant points. We show in 
Figure 8 an example of application for a hip medical image where we are looking 
for vessels. Potential P is defined using ideas from [9] on vesselness filter. 

For this, we propose to extract valuable information from this dataset, com- 
puting a multiscale vessel enhancement measure, based on the work of [9] on 
ridge filters. Having extracted the two eigenvalues of the Hessian matrix com- 
puted at scale a, ordered |Ai| < IA 2 I, we define at each pixel a vesselness function 

r 0, if A2 > 0 

|exp^f^(l -exp^) 

where Rb = |^, and S = \/ Ai^ + A 2 ^. See [9] for a detailed explanation of the 
settings of each parameter in this measure. Using this information computed at 
several scales, we can take as new image the maximum of the response of the 
filter across all scales. In figure 8-top right you can observe the response of the 
filter, based on the Hessian information. 

And we can easily give a very constrained threshold of this image, that will 
lead to sets of unstructured pixels that surely belong to the anatomical object of 
interest, as shown in figure 8-bottom left. Figure 8-bottom right shows the set of 
completion paths obtained that link all given connected components together. 
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Fig. 8. Medical Image. First line: original image and vesselness potential; Second line: 
from the set of regions obtained from thresholding of potential image, our method finds 
links between these regions as minimal paths with respect to the potential 



5 Conclusion 

We presented a new method that finds a set of contour curves in an image. It 
was applied to perceptual grouping to get complete curves from a set of noisy 
contours or edge points with gaps. The technique is based on previous work of 
finding minimal paths between two end points [6]. However, in our approach, we 
do not need to give the start and end points as initialization. In a first method, 
given a set of key points, we found the pairs of key points that had to be linked by 
minimal paths. Once saddle points between pairs of points are found, paths are 
drawn on the image from the selected saddle points to both points of each pair. 
This gives the minimal paths between selected pairs of points. The whole set of 
paths completes the initial set of contours and allows to close these contours. In a 
second method, we compute the whole set of paths between unstructured regions 
in the image. We illustrate this approach with a medical image application. 
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Abstract. Algorithms for tracking generic 2D object boundaries in a 
video sequence should not make strong assumptions about the shapes to 
be tracked. When only a weak prior is at hand, the tracker performance 
becomes heavily dependent on its ability to detect image features; to 
classify them as informative (i.e., belonging to the object boundary) or 
as outliers; and to match the informative features with corresponding 
model points. Unlike simpler approaches often adopted in tracking prob- 
lems, this work looks at feature classification and matching as two unsu- 
pervised learning problems. Consequently, object tracking is converted 
into a problem of dynamic clustering of data, which is solved using com- 
petitive learning algorithms. It is shown that competitive learning is a 
key technique for obtaining accurate local motion estimates (avoiding 
aperture problems) and for discarding the outliers. In fact, the competi- 
tive learning approach shows several benefits: (i) a gradual propagation 
of shape information across the model; (ii) the use of shape and noise 
models competing for explaining the data; and (iii) the possibility of 
adopting high dimensional feature spaces containing relevant informa- 
tion extracted from the image. This work extends the unified framework 
proposed by the authors in [1]. 



1 Introduction 

Object tracking has attracted the attention of researchers over the last three 
decades. Several approaches have been proposed to address this problem. One 
popular approach consists of using a continuous description of the object bound- 
ary (e.g., using a parametric curve) attracted by discrete features detected in the 
image (e.g., edge points) [2,7]. Two problems make this approach difficult. First, 
it is not easy to associate image points to model points. This matching prob- 
lem is usually addressed by recursive suboptimal techniques [4]. Second, only a 
subset of the detected features is associated to the boundary of the object to be 
tracked. Many of the detected features are associated to inner edges or to the 
background edges, and should therefore be considered as outliers. The outlier 
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classification is not easy. This problem is similar to the false alarm problem in 
radar processing [3], although there are some differences: in tracking, the per- 
centage of outliers is usually higher and they are not uniformly distributed in 
space. In fact, they tend to cluster along strokes. 

One way to circumvent the matching problem is by using model driven fea- 
ture detection methods. In this case, the model boundary is sampled and an 
image feature is searched in the vicinity of each model sample (e.g., using a uni- 
directional search along a direction orthogonal to the object boundary) [16,6]. 
This approach avoids the matching problem but it makes feature detection de- 
pendent on the initial estimate of the model boundary, obtained by prediction. 
The tracker will perform well while the object motion and shape is highly pre- 
dictable but it fails in the presence of rapid motion or shape changes [13]. 

The presence of invalid features (outliers) is also a major source of difficulties. 
Most trackers use ad hoc techniques to separate good features from the outliers 
e.g., by discarding those features which are far from the object boundary [8]. 
Again, since we do not know the exact location of the object boundary and rely 
on estimates only, the performance of such techniques becomes dependent on the 
initial model obtained by prediction and fails in the case of inaccurate prediction 
results. 

This paper addresses data matching and outlier segmentation using unsuper- 
vised learning techniques. A class of methods is presented which extends several 
well known algorithms (Kohonen maps, elastic nets, snakes, fuzzy c-means) and 
allows the design of new methods. All these methods represent the image fea- 
tures by a small number of centroids which are easily associated to model points. 
Matching is therefore achieved in this way while the outlier segmentation is per- 
formed during the unsupervised learning process. 

This paper extends the work presented in [1] and is organized as follows: 
section 2 formulates the problem; section 3 describes a unified framework for 
shape estimation, which adopts centroid-based approaches for data matching; 
outlier rejection is discussed in section 4; experimental results are presented in 
sections 3 and 4; section 5 concludes the paper with a discussion of the proposed 
techniques. 

2 Problem Formulation 

Let x{s) = {xi{s),X 2 {s)) denote a 2D curve belonging to the admissible shape 
space 



A = {as'(s) : a;(A') = H{s) 0} (1) 

where H{s) is a known 2 x N matrix with shape basis functions and 0 is a 
A^ X 1 vector of coefficients. Since the curve x depends on 0, it will be denoted as 
x(.s, 6). Several sets of basis functions can be used to define the admissible shape 
space, such as splines, sine or sinusoidal functions [5]. Furthermore, this model 
also allows to represent the object boundary as a geometric transformation (e.g., 
affine transformation) of a reference shape [6] . 
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In dynamic problems, it will be assumed that the object shape describes a 
trajectory in the admissible shape space, x(,s, 9^). The evolution of (9‘ is described 
by a dynamic model, e.g., a stochastic difference equation. The estimation of the 
model parameters from an image sequence is achieved by solving an optimization 
problem 



0* = argmin£'(0, y*) (2) 

9 

where 

E{9,Y^) = Er{e) + E^{9,Y^) (3) 

is an energy function with a data dependent term and a regularization term, 
and y* = {yi, ■ ■ ■ ,yp} denotes the visual features detected at time t, e.g., the 
2D coordinates of boundary points detected in the image. 

The energy function is the sum of two competing terms: a regularization term 
and a data dependent term. The regularization term is low as long as the shape 
trajectory is well described by the dynamic model (i.e., while the object shape is 
close to the predicted shape) and increases as the shape evolution deviates from 
the predicted shape. The role of the data dependent term, E^, is to evaluate the 
ability of the model to represent the observed data y*. 

The optimization problem defined above can also be interpreted as a MAP 
estimation of the object shape, assuming a Gibbs model for the joint density 
p{9,Y) [15]. 



2.1 Regularization Energy 

It is often assumed that the object shape in a new frame t is close to a predicted 
shape, X, with an isotropic and Gaussian uncertainty. In this case, the L 2 norm 
is used to define the regularization energy. 



Er = Otr 



/ 



||a;(,s) 






( 4 ) 



where is a constant which controls the uncertainty level of the dynamic model. 
Using (1) and (4) it can be shown that 



Er = ar{9 - 9f me - 9) 



where 



U = 



J H{sfH{s)ds 



( 5 ) 

( 6 ) 



is usually denoted as the metric matrix of the admissible shape space A. 
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(a) 




Fig. 1. Shape estimation problem, (a) Known data matching; and (b) unknown data 
matching (predicted model shape is represented by the dashed line; model samples are 
represented by the circles and data features by cross marks). 



2.2 Data Energy 

Denote F* as the set of edge points detected in the t-th image. Ideally, we would 
like to detect image points belonging to the object boundary only. However, 
every practical edge detector also yields edge points produced by other objects 
present in the scene as well as inner edges produced by the object to be tracked. 
These points will be considered as outliers. 

The data energy should measure the match between the image features and 
the object model (see fig. lb) 

Ea = d{Y\x{e)) (7) 

where d is a distortion measure between both sets of points. 

This raises two difficulties: first we should be able to discard the influence of 
outliers; second we should be able to define a 1-1 match between image features 
and model points. Both operations are difficult. We will address each of these 
problems in the next two sections. 

3 Shape Estimation 

This section addresses shape estimation from a set of edge points detected in 
the image. For the sake of simplicity, it will be assumed first that all the points 
belong to the boundary of the object to be tracked. Furthermore, the object 
boundary is assumed to be represented by a set of samples x{si),i = 1, ■ ■ ■ , M. 

Two cases will be considered below (see fig. 1). First we will assume that 
there is a known 1-1 matching between the model points and the image features, 
i.e., each edge point detected in the image is matched to a known contour sam- 
ple. This situation occurs when feature extraction is model driven, e.g., when 
the image features are detected by unidimensional search procedures along direc- 
tions normal to the object contour. Second, we will discuss the case of unknown 
matching. This situation occurs in image driven feature detection methods in 
which the number of image features is much higher than the number of model 
samples. 
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3.1 Known Matching 

Let x{si),i = 1, ■ ■ ■ be the samples of the predicted contour and let yi G 
= I,-- - ,M be the corresponding image features (see fig. la). Several 
metrics have been used to measure the difference between both sets of points, a 
popular solution being a weighted sum of the squared errors 

M ^ 

= ( 8 ) 

1=1 ^ 

Parameters Uj assign a confidence degree to each observed feature. The estima- 
tion of the unknown vector parameters, 9, is obtained by minimizing (8). Since 
the gradient of (8) is 



dEg 

89 



i ^ 



(9) 



the optimal solution is 



Od = s-^z ( 10 ) 

with 

5 = 5 ^ ( 11 ) 

and 

^ = E (12) 

i ^ 

A minimum of N /2 features are needed to obtain a solution for the shape esti- 
mation problem (non singular matrix S). The minimization of the weighted least 
squares criterion (8) is equivalent to a maximum likelihood estimation of the un- 
known parameters, assuming that the image features are independent random 
variables with Gaussian distribution and omnidirectional uncertainty. 

This model has two drawbacks: (i) it is not robust with respect to the presence 
of outliers (this problem will be addressed later) ; (ii) it can not be used in the case 
of unmatched data. The use of the regularization energy allows to alleviate these 
drawbacks when the number of outliers is small and most of the observations 
are correctly matched with model points. 

The solution of the optimization problem (3) with regularization is given by 

9 = {arU + Sy^ {arH9 + Z) (13) 

where matrices S and Z are defined in (11) and (12), respectively. 
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3.2 Unknown Matching 

Let us now address the problem of shape estimation from a set of unmatched 
data points (see fig. lb). This problem has been addressed by several authors 
[11,7,4]. We will focus on an unsupervised learning strategy, developed by the 
authors, denoted as Unified Framework [1]. This framework extends several well 
known algorithms such as snakes [11], Kohonen maps [12], elastic nets [10], fuzzy 
c- means [9], and allows the development of new ones. 

Let x{si),i = 1, • • ■ ,M and yi,i = 1, - ■ ■ , P be the unmatched contour sam- 
ples and image features, respectively. Since matching is unknown we will assume 
that each data feature is associated with all model samples with different con- 
fidence degrees (see fig. 2). The data energy adopted in this framework is given 
by (compare with (8)) 



^ M p 

- x{su9)f (14) 

i=i 

where is the confidence degree associated to the tentative match between 
x(si) and yj. This expression includes, as special cases, the objective functions 
minimized by snakes, Kohonen maps, elastic nets and fuzzy c- means, i.e., all 
these methods belong to the unified framework (see details in [1]). The weights 
Wij associated with these methods are defined in table 1. Other choices for Wij 
are also possible allowing the development of other shape estimation algorithms. 
The optimization of energy (14) can be achieved by evaluating its gradient, 

BF' 

(15) 

i 

where and are the centroid and the mass of the data associated with the 
z-th model sample, defined by 

Ci — 'p a — (16) 

The weight represents the influence of the j-th feature on the i-th model 
sample. The influence functions for each of the previous methods are defined in 
table 1. They are related with the confidence degrees by 

(17) 

k 



where d,j = \\yj - x(si)]p. 

Figure 3 shows the influence functions and corresponding attraction regions 
(level set regions) in the case of a contour with a few samples. While the snake 
algorithm has omnidirectional attraction regions, the attraction regions of the 
other algorithms are no longer omnidirectional since they are influenced by the 




582 



Arnaldo J. Abrantes and Jorge S. Marques 




Fig. 2. Tentative matching between data points and model samples (only a few samples 
are shown). 

Table 1. Weighting functions of snakes, Kohonen maps, elastic nets, fuzzy c-means and 
crisp c-means. Note that dij = ||a;(si) — yj\\^, = exp(— dij/2(T^), i* {j) denotes 

the index of the model sample nearest to yj, q controls the degree of fuzzyness, d is the 
Kronecker function and At is the neighborhood function used in Kohonen maps. 





Wij 


dij 


Snakes 


dij 


(pa {dij ) 


Kohonen maps 




Ar(i,i*{j)) 


Elastic nets 


- 2 < t ^ log Efc 


4>cr ) 


M dij 




Fuzzy C-means 


(Edy*)" 


(sdfc)*)" 


C-means 







location of the neighboring contour samples. This is due to presence of com- 
petitive learning mechanisms — the model samples compete to represent the 
data. 

Comparing (15) with (9) it can be concluded that the centroids play the role 
of the observations, z/j, and the masses play the role of in the known 

matching case. Therefore, the solution for the minimization problem (14) is still 
given by (10), with 



S = Y,^hH{s,)^H{s,) (18) 

I 

and 



Z = (19) 

i 

Since the depend on the initial contour estimate, an iterative process is 
usually adopted to estimate the object boundary. The algorithm is summarized 
in table 2. 
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Table 2. Pseudo-code for the centroid-based tracking algorithm. 

Predict model shape 
Repeat 

Compute M samples of the model shape . 

Evaluate centroids and masses using (16) 

Estimate model parameters using (18, 19) in (13) (or in (10)). 
Until a pre-defined number of iterations is performed 



J f I 






Fig. 3. Influence functions and attraction regions of (a) snakes, (b) elastic nets, (c) 
c-means, (d) fuzzy c-means and (e) Kohonen maps. 



The next two examples illustrate the role played by the competitive learning 
mechanisms in shape estimation. It is emphasized that the use of competitive 
learning improves the solution of the matching problem. 
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motion vector: 




(c) (d) 



Fig. 4. Tracking example: (a) initial contour (dashed line) and data points (solid line); 
(b) nearest neighbor matching; and (c, d) centroid based matching (fuzzy c-means) 
after (c) 6 and (d) 200 iterations (the updated shapes are not shown). 



Example 1 

Figure 4a shows the example of an object undergoing a uniform motion. The 
samples of the predicted shape are represented by small circles and the observed 
data by a solid line representing a large number of edge points. The motion vec- 
tor is represented by an arrow. Figure 4b shows a tentative matching between 
the model points and observed data using the nearest neighbor criterion, used 
for example in the ICP algorithm [4]. The displacement vectors associated with 
each match are not consistent, therefore leading to erroneous alignment results. 
Figures 4c, d show the centroid locations obtained with the fuzzy c-means after 
6 and 200 iterations. Although the shape estimate was updated during the opti- 
mization process, only the initial position of the model is shown for the sake of 
simplicity. It is stressed that consistent displacement vectors are now obtained 
(see fig. 4d) due to the presence of the competitive learning mechanisms which 
avoid having the same data represented by different model samples. 



Example 2 

Figure 5 compares the ability of 6 different data association methods to recover 
(or not) from a large prediction error. In this figure the initial model configura- 
tion (predicted shape) is represented by a dashed line, the edge data is repre- 
sented by a solid line and the association process is visualized by the segments 
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Fig. 5. Data matching after the first iteration for (a) snakes, (b) elastic nets, (c) c- 
means, (d) fuzzy c-means, (e) Kohonen maps and (f) nearest neighbor. 




(a) (b) (c) 

Fig. 6. Data matching after (a) first; (b) second; and (c) third iteration (fuzzy c-means). 



joining model samples with data centroids (small circles). It is observed that the 
methods without competitive learning (snakes and nearest neighbor; see figs. 
5a, f) fail to match the data points with the correct model samples. In fact, dif- 
ferent model points are attracted toward the same data features and the shape 
of the model configuration (not shown in the figure) will collapse after a few 
iterations. Data association methods with competitive learning show clear im- 
provements. All these methods manage to attract the upper and lower boundary 
of the model towards centroids located in the correct directions (see figs. 5b-e). 
As the number of iterations increases the matching accuracy gradually improves 
(see fig. 6 for the case of fuzzy c-means). 
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Fig. 7. Shape estimation failure in the presence of large number of outliers, (a) detected 
edge points; (b) initial contour; (c) final contour obtained with Kohonen maps. 



4 Robust Data Association 

Previous methods work well when the boundary edges are segmented from back- 
ground and inner edges (outliers). However, these methods fail when such edges 
are not removed (see fig. 7). The problem is even more severe in the case of 
competitive learning methods with unbounded attraction regions. 

Two methods will be used in the sequel to overcome these difficulties. The 
first consists of segmenting the data as valid or invalid. This will be performed by 
a soft decision process involving the concept of noise plane. The second approach 
consists of using extended features i.e., feature vectors with dimension higher 
than two which contain the edge coordinates as before and additional properties 
of the image in the vicinity of the edge point (e.g., color, gradient direction). 
A feature will attract a model point if they are close in the high dimension 
space i.e., if they are close in the image plane and furthermore if the other 
properties match. This provides an additional criterion which allows to discard 
the influence of the outliers. The use of extended features was advocated by a 
number of authors [17,14]. Both methods can be implemented in the scope of the 
unified framework as shown below by introducing minor changes in the energy 
function (14). 



4.1 Noise Plane 

The detected edges which are far from the predicted contour should have a low 
confidence degree since the probability of belonging to the object boundary is 
low. Methods with unbounded attraction regions cannot cope with this situa- 
tion: far away edges will have a strong influence on the final shape estimate. To 
overcome this difficulty a virtual model sample, o;(sm-(-i): will be defined in such 
a way that dM+iy = i-G-, the distance of this additional sample to all image 

features is constant. The virtual sample .r:(,Si\f+i) is not a 2D point. Instead, it 
can be interpreted as a plane parallel to the image plane, which competes with 
the model samples to represent the data (see fig. 8). Points which are close to the 
object model will be represented by ordinary model samples while points far from 
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Noise plane 



X(S.) 



Image plane 



Fig. 8. Noise model represented by a plane parallel to the image plane. 




Fig. 9. Effect of the noise plane on the attraction regions of fuzzy c-means for different 
values of rj: (a) large, (b) medium and (c) small. 



the predicted shape will be best approximated by the noise plane. Therefore, the 
data energy becomes 



^ M + l P 

= 9 X] (20) 

i=l j = l 

where the confidence degrees Wij are still given by the expressions in table 1. 

Figure 9 shows the effect of the noise plane on the attraction regions of fuzzy 
c-means for three values of rj. It is observed that the practical influence of the 
noise plane is to bound the attraction regions while it preserves their shape in 
the vicinity of the predicted contour. 

Figure 10 shows the oval shape estimation problem, previously discussed 
in this section (see figs. 5 and 6), but now with an additional difficulty: the 
presence of outliers. This problem is not easy since the predicted shape is far 
from the true object boundary and it is closer to outliers detected in the image. 
Furthermore, we can not benefit from imposing a rigid model shape since we 
allow the predicted shape to deform. Despite these difficulties the fuzzy c-means 
algorithm with noise plane manages to estimate the object boundary well and 
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10 and (c) 30 iterations. 



to discard the incorrect data. These results were obtained in two stages, each 
one with a different value for rj: a large value during the first 20 iterations and 
a much smaller one during the last 20. The first stage (with large r/) allows a 
global convergence of the model towards the data, while the second stage (with 
small rj) refines the solution. 



4.2 Extended Features 

The noise plane efficiently discards edges which are far from the predicted shape 
but it is useless for discarding close edge points. An additional technique must 
be devised for dealing with such cases. 

Until this point only edge information was used to estimate the object bound- 
ary. The unified framework is easily extended to feature spaees of arbitrary di- 
mension which may include color, texture and gradient information as well. The 
energy function in such case is given by 

= -x{s^,0)f ( 21 ) 

i j 

where x{si,6) £ and yj £ 7?.^+^, K being the number of additional 

features associated with each model or image point, respectively. To be more 
specific, x{si) = [x(sj)^ t\ . . . and = [yJ fj... ffY, where tf 

denote the K feature values associated with the t-th model point, x{si,9), and 
fj, , fj^, the corresponding feature values, obtained at yj, using a set of ade- 
quate image feature extractors. 

Since the gradient of the new energy (21) is similar to (15), the centroids 
and masses being computed by (16) as before, parameter estimation is still per- 
formed by the algorithm described in section 3.2 (see table 2). The weights -dij 
are defined in table 1 with d^j replaced by dij = \\yj —x(s^, 0)W^. We note that the 
centroids are still 2D vectors. The additional features only influence the weight- 
ing functions e.g., if the properties associated with an image feature and a 
model sample are different, then the weighting function dij will be negligible, 
even if they are close in the image plane. For example, the RGB components or 
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the gradient direction in the vicinity of an edge point can be used as extended 
features. In that case, the model samples x(.Si, 0) will only be attracted by image 
features with similar color and gradient direction. This allows to discriminate 
among features belonging to different objects present in the scene, therefore 
increasing the ability of the model to discard outliers. 

Figure 11 shows an example of grain boundary estimation in a ceramic ma- 
terial using extended features. These images were obtained with a scanning elec- 
tron microscope (SEM images). This is an example in which the use of noise 
plane is not enough to succeed in estimating the grain boundaries. In this case, 
the feature vector was extended with two additional components: cos(0), sin(d), 
0 being the gradient direction i.e., a unit vector orthogonal to the iso- intensity 
curve at each edge point. It can be observed from this example that the use of 
extended features allows to discriminate the edge points belonging to each grain 
from the edges of neighboring grains. 

5 Conclusion 

This paper addresses the tracking of moving objects from a set of visual fea- 
tures detected in the image. Most tracking systems rely on three basic steps: 
shape prediction, feature detection and contour update. Several methods have 
been proposed for each of these operations. Two key difficulties concern the 
association of image features to model points and the presence of outliers i.e., 
detected features which do not belong to the object to be tracked and should be 
eliminated. 

To circumvent these difficulties many trackers adopt a model driven feature 
detection in which the predicted shape is used to guide the image analysis oper- 
ations. This approach solves the matching problem since a single image feature 
is obtained for each model sample. However, it is strongly dependent on the pre- 
diction results. Therefore, good tracking results can only be obtained when the 
object motion and shape evolve according to slowly varying trajectories. This 
approach is suitable in some practical applications but it can not cope with the 
shape variability which occurs in many other problems. 

In this work, a set of centroids is used to represent the data features de- 
tected in the image (e.g., edge points) using competitive learning techniques. 
The centroids are initialized by sampling the boundary of the predicted contour 
model. The initialization procedure creates a natural association among the cen- 
troids and model points, therefore circumventing the need of an explicit matching 
technique. It has been shown that competitive learning procedures used for the 
update of centroids play a key role in the improvement of the consistency of local 
matching between data features and model points. 

A second problem which has been addressed in this paper concerns the ro- 
bustness of the shape estimate in the presence of outliers. To alleviate this dif- 
ficulty two techniques were proposed. The first method consists of using a noise 
model, denoted as noise plane, which reduces the scope of the attraction regions 
associated with each centroid: far away features will be discarded since they 
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(c) (d) 



Fig. 11. Segmentation of microscope image of ceramic material, (a) contour seeds; (b) 
extended features (edge points and gradient direction); (c, d) results obtained with (c) 
edge points only and with (d) extended features. 



will not be able to attract the centroids. The second method consists of using 
extended features. A set of features, such as color, texture or gradient informa- 
tion, is associated with each model point. Only the image features with similar 
properties will be able to attract the corresponding model points. Therefore, the 
attraction mechanisms become much more selective allowing the elimination of 
outliers. This provides an efficient ground for performing data fusion in the scope 
of shape estimation and tracking. 
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Abstract. We address the theoretical problems of optical flow estima- 
tion and image registration in a multi-scale framework in any dimen- 
sion. Much work has been done based on the minimization of a distance 
between a first image and a second image after applying deformation 
or motion field. We discuss the classical multiscale approach and point 
out the problem of validity of the motion constraint equation (MCE) 
at lower resolutions. We introduce a new local rigidity hypothesis allow- 
ing to write proof of such a validity. This allows us to derive sufficient 
conditions for convergence of a new multi-scale and iterative motion es- 
timation/registration scheme towards a global minimum of the usual 
nonlinear energy instead of a local minimum as did all previous meth- 
ods. Although some of the sufficient conditions cannot always be fulfilled 
because of the absence of the necessary a priori knowledge on the motion, 
we use an implicit approach. We illustrate our method by showing re- 
sults on synthetic and real examples (Motion, Registration, Morphing), 
including large deformation experiments. 

Keywords: motion estimation, registration, optical flow, multi-scale, 
motion constraint equation, global minimization, stereo matching 



1 Introduction 



Registration and motion estimation are one of the most challenging problems in 
computer vision, having uncountable applications in various domains 
[17,18,6,4,13,30]. These problems occur in many applications like medical image 
analysis, recognition, visual servoing, stereoscopic vision, satellite imagery or in- 
dexation. Hence they have constantly been addressed in the literature through- 
out the development of image processing techniques. For example (Figure 1) 
consider the problem of finding the motion in a two-dimensional images se- 
quence. We then look for a displacement {hi{xi,X 2 ),h 2 {xi,X 2 )) that minimizes 
an energy functional: 



\Ii{x,y) - hix + hi{x,y),y + h 2 {x,y))\^dxdy. 



M.A.T. Figueiredo, J. Zerubia, A.K. Jain (Eds.): EMMCVPR 2001, LNCS 2134, pp. 592-607, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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Fig. 1. Finding the motion in a two-dimensional images sequence 



Next consider the problem of finding (/i(xi, X2), /2(a^i, ^2)) a rigid or non rigid 
deformation between two images (Figure 2 ), minimizing an energy functional: 



\h{x,y) - I2{fi{x,y),f2{x,y))\^dxdy. 



Although most papers deal only with motion estimation or matching depend- 
ing on the application in view, both problems can be formulated the same way 
and be solved with the same algorithm. Thus the work we present can be applied 
both to registration for a pair of images to match (stereo, medical or morphing) 
or motion field / optical flow for a sequence of images. In this paper we will 
focus our attention on these problems assuming grey level conservation between 
both images to be matched. Let us denote by I\{x) and hix) respectively the 
study and target images to be matched, where x G D = [-M, M]'^ c and 
d > 1 . In the following Ii and I2 are supposed to belong to the space Cq{D) of 
continuously differentiable functions vanishing on the domain boundary dD. We 
will then assume there exists a homeomorphism f* oi D which represents the 
deformation such that: 



h{x) = ho r{x),yxGD. 

In the context of optical flow estimation, let us denote by h* its associated motion 
field defined by h* = f* ~ Id on D. We thus have: 

h{x) = hix + h*{x)). ( 1 ) 

h* is obviously a global minimum of the nonlinear functional 

ENi{h) = ]~ [ \Ii{x) - l2{x-\-h{x))\‘^dx. ( 2 ) 

^ Jd 

We can deduce from ( 1 ) the well known Motion Constraint Equation (also called 
Optical Flow Constraint): 



Ii{x) — hix) ~< V/2(x), h*{x) > , V.T G D. 



( 3 ) 
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Enl is classically replaced in the literature by its quadratic version substituting 
the integrand with the squared difference between both left and right terms of 
the MCE, yielding the classical energy for the optical flow problem: 

ElW = ^ [ \h{x) - h{x)~ < V hix) , h{x) > \'^dx. 

Here V denotes the gradient operator. Since the work of Horn and Schunk [17], 
MCE (3) has been widely used as a first order differential model in motion 
estimation and registration algorithms. In order to overcome the too low spatio- 
temporal sampling problem which causes numerical algorithms to converge to 
the closest local minimum of the energy EyvL instead of a global one, Ter- 
zopoulos et al. [24,30] and Adelson and Bergen [8,29] proposed to consider it 
at different scales. This led to the popular coarse-to-fine minimizing technique 
[18,11,13,25,14]. It is based on the remark that MCE (3) is a first order expan- 
sion which is generally no longer valid with h* searched for. The idea is then 
to consider images at a coarse resolution and to refine iteratively the estimation 
process. 

Using a regularizing kernel G<j at scale a, Terzopoulos et al. [24,30] and 
Adelson and Bergen [8] were led to consider the following modified MCE: 

G, (7i - hKx) G, * Vl 2 {x),h*{x) > (4) 



Remark. 

One could also consider regularizing both left and right terms of the original 
MCE, yielding the following alternative: 

G. * (/i - hKx) ~ G, * (< Vh,h* >){x) 

At finest scales it can be shown that these two propositions are equivalent. 

To our knowledge and despite the huge literature on these approaches, no theo- 
retical error analysis can be found when such approximations are done. Though 
it has been reported from numerical experiments that the modified MCE was 
not performing well at very coarse scales, thus betraying its progressive lack of 
sharpness, many authors pointed out convergence properties of such algorithms 
towards a dominant motion in the case of motion estimation [7,11,10,21,9,16], 
or an acceptable deformation in the case of registration [13,25,26], even if the 
initial motion were large. It is widely assumed that deformation fields have some 
continuity or regularity properties, leading to the addition of some particular 
regularizing terms to the quadratic functional [17,5,30,3,2]. Let us emphasize 
on the modified MCE (4). We note h the value of h that reaches minimum for 
energy 

j \G^*{h-h)- <G,*Vh,h>\^dx. (5) 

This multiscale approach assumes that Eqn. (4) is “valid” at lower resolutions, 
which ensures that h will be close to h*. 
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Fig. 2. Finding a non rigid deformation between two images 



Although it may come from the fact that flattened images are always “more 
similar” , to our knowledge and despite the huge literature, no theoretical anal- 
ysis can confirm this. Replacing G„ by a particular low pass filter U„ (here 
cr > 0 is proportional to the number of considered harmonics in the Fourier de- 
composition), we will address the problem of finding a linear operator such 
that Pl^(h*) is close to U„^Ii — 12^- The sharpness of this approximation will 
decrease with respect to both h* norm and resolution parameter a. This will 
lead us to introduce a new local rigidity hypothesis of deformations / = Id + h 
with respect to image /i. Hence such deformations allow to find the operator 
satisfying our validity constraint on the modified MCE. 

Considering general linear parametric motion models for h* , we give suffi- 
cient conditions for asymptotic convergence of the sequence of combined motion 
estimations towards h* together with the numerical convergence of the sequence 
of deformed templates towards the target I 2 ■ Roughly speaking, the shape of the 
theorem will be the following: 

Theorem: If 

1. at each step the residual deformation is “locally rigid”, and the associated 
motion can be linearly decomposed onto an “acceptable” set of functions the 
cardinal of which is not too large with respect to the scale, 

2. the initial motion norm is not too large, and the systems conditionings do 
not decrease “too rapidly” when iterating, 

3. the estimated deformations Id + hi are invertible and “locally rigid”. 

Then the iterative scheme “converges” towards a global minimum of the energy 
Enl- 

The outline of the paper is as follows. In Section 2 we introduce a new local rigid- 
ity hypothesis and a low pass filter in order to derive a new MCE of the type of 
equation (6). In Section 3 we design an iterative motion estimation/registration 
scheme based on the MCE introduced in Section 2 and prove a convergence the- 
orem. In order to avoid the a priori motion representation problem, we adopt an 
implicit approach in Section 4 and constrain each estimated deformation Id + h^ 
to be at least invertible. We show numerical results for large deformations prob- 
lems in dimension 2. Section 5 gives a general conclusion to the paper. 
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2 Valid Modified MCE Upon 

a New Local Rigidity Hypothesis 

Assuming a local rigidity hypothesis and adopting the Dirichlet low-pass opera- 
tor TTct , we will find a different right hand side featuring a “natural” and unique 
linear operator in the sense that: 

( 6 ) 

with remainder of the order of ||h*|P for some particular norm and vanishing as 
the scale is coarser {a close to 0). 

2.1 Local Rigidity Property 

In this paragraph we introduce our local rigidity property of deformations. 
Notations in this context are to be understood as follows: 

— D = [-M,M]^ in 

— I\^p, l 2 ,p, h, are functions from IR'^ to IR. 

— h{x), h*{x) are functions from IR!^ to IR'^ . 

— < > denotes the scalar product in 

— denotes the scalar product in 

For technical reasons we assume that I\ and I 2 belong to Cg(Zl), and Ii{x) = 
l 2 {x + h*{x)), X £ D, h*{x) e 5?'^. 

Definition 1. / G Hom{D) is ^ -rigid for I\ G C^{D) iff: 

JacUf.Vh = det{Jac{f))Vh, (7) 

where Jac{f) denotes the Jacobian matrix of f and det{A) the determinant of 
matrix A, and Hom{D) the space of continuously differentiable and invertible 
functions from D to D (homeomorphisms). 

All Crigid deformations have the following properties (see [19] for the proofs). 
Assume f* is C^igid for R G Cq{D) and h = I 2 ° f* ■ Then, 

1. equation (7) is always true if dimension d is 1; 

2. for all d > 1, 

(a) ||V/i||li = IIV/ 2 IILI) where denotes the space of integrable functions 
over D; 

(b) V/i // V/ 20 /*. 

(c) relation ~ defined by 

[7i I 2 ] [3/ frigid for h s.t. R = I 2 o f] 

is an equivalence relation on Cq{D); 

3. suppose d = 2: then, 

(a) if Jacff*) is symmetric, then (7) means that if |VIi| R 0, 
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Fig. 3. An example of motion h = f — Id oi a. ^-rigid deformation / for image Ii. We 
show a level set of image Ii, and the fields V/i and h along its boundary, h varies only 
along the direction of V/i. 



— direction ry = is eigenvector (A = det{Jac{f) is an eigenvalue); 

— direction ^ is “rigid” (A = 1 is an eigenvalue); 

This property can be seen as a non-sliding motion property. We illus- 
trated this interesting property in Figure 3, where we show a level set of 
/i, and a motion h = f — Id oi a .^-rigid deformation / for image Ii- h 
can vary only along the direction of VIi. 

(b) k{Ii) = [Tr{Jac{f*))—det{Jac{f*))].K{l 2 )of*, where k{I){x) stands for 
the curvature of the level line of I passing through x and Tr{A) denotes 
the trace of matrix A] 

4. if d = 1 or 2, and 

— h* is known at 

• 1 point (d = 1). 

• each isolated critical point of Ii and at one interior point of each 
connected constant set of Ii (d = 2). 

— h = h* at this(ese) point(s), and 

Ii = I 2 o [Id + h) on D, 

then for all x G H where Ii is not locally constant we have h[x) = h*[x). 

Remark. 

It is an important issue to know whether such h* is unique. In case d G {1,2}, 
property 4 leads to uniqueness if h* is known at some isolated points. Though 
it is not proved in the general case, we will assume uniqueness hereafter for 
simplicity. 

As a consequence we can show that ^-rigid deformations of images can be trans- 
fered to test functions. Indeed, we have the following 

Lemma 1. Suppose that 

1. I\ and I 2 G Cq[D) are sueh that: Ii = I 2 ° f 
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2. f is ^ -rigid for Ii 

3. 4> ^ C°°{D] IR), and e C°°{D; IR'^) s.t. divd> = (f, where C°°{D;1R) de- 
notes the space of indefinitely differentiable function from D to IR. 

Then, Jjj{h — h)<i>dx = < V/i, o f - <P > dx. 

Proof. See [20] I 



2.2 The Dirichlet Operator 



One choice for the set of test functions in Lemma 1 is the Fourier basis, the 
simplest projection onto which is the Dirichlet projection operator. Let D = 
[— M, M]*^; Srj = {k e Wi E [l,rf], \kt\ < Ma^}; Cfc(/) denotes the Fourier 
coefficient of I defined by: 



Ck{I) = 



1 

(2M)5 




ITT <k ,x> 
M 



dx. 



Then the Dirichlet operator 77^ is the linear mapping associating to each function 
I E Cq{D) the function = G„ * /, where the convolution kernel is 

defined by its Fourier coefficients as follows: 



Cfc(G.) 



1 if 7 e S',, 
0 elsewhere 



2.3 New MCE by Linearization for the Dirichlet Projection 



Now that we have introduced our rigidity property of deformations and the 
Dirichlet projection, let us choose the test functions of Lemma 1 in the Fourier 
basis. Defining {h*){x) through its Fourier coefficients: 



CkiP^^ih*)) 



ico(< Vh,h* >) if fc = 0 

^^(<vp,fc><M:>) if ^ g 5^/{0} 

0 if fc ^ S,, 



we obtain the 

Theorem 1. Iff*=Id+h* is f -rigid for I\ = I 20 f* E Gq{D), then we have: 

This inequality is nothing but the sharpness of MCE (6) : 

n,{h-h){x)^Pi^{h*){x), ( 8 ) 

at scale ct. It clearly expresses the fact that measuring the motion (e.g perceiving 
the optical flow) h* is not relevant outside of the support of |VJi|. 

Proof. See [20] I 
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3 Theoretical Iterative Scheme 
and Convergence Theorem 

In section 2 we found a new MCE and showed that we can control the sharpness 
of it. In this section we will make a rather general assumption on the motion in 
the sense that it should belong to some linear parametric motion model without 
being more specific on the model basis functions. Though it is somewhat restric- 
tive to have motion fields in a finite dimensional functional space, this structural 
hypothesis will be a key to bounding the residual motion norm after registration 
in order to iterate the process. This makes it possible to consider a constraint 
on motion when there is a priori knowledge (like for rigid motion) or consider 
multi-scale decomposition of motion for an iterative scheme. 



3.1 Linear Parametric Motion Models and Least Sqnare Estimation 

Let us assume the motion h* has to be in a finite dimensional space of de- 
formation generated by basis functions \['{x) = Thus h* can be 

decomposed in the basis: 3 0* = {9*)i=i..n unique, such that: 

h*{x) =< 'P{x),0* >=^ 9*ipi{x),yx G Supp{\VIi\). 

1 = 1 . .n 

MCE (6) viewed as a linear model writes: 

n,{h-h)=<p^^m,e* > . 

Now set, for cr s.t. the be mutually linearly independent in L^: 

M, = Pi- (!f) O Pi- Y, = 77, (7i - 72), 

where ® stands for the tensorial product in 7^ . Then applying basic results from 
the classical theory of linear models yields: h =< 'P,0 >=< (7, >, where 

column 77, ’s components are defined by (77,)i =< Pi- {'tpi) ,Ya- >. 

3.2 Estimation Error and Residual Motion 

Given the least square estimation of the motion of last paragraph, we have 

Lemma 2. In this framework the motion estimation error is bounded by in- 
equality 

\\Ch-h*)\Vh\iU. < |a'^+2(rr(M-i)))"||7*|V7i|5j|2„ 

Proof. See [20] I 

If 7d + 7 is invertible, we can define: 



7iq = 7i o lld-\-h 



(9) 
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Letting r\ denote the residual motion such that = 72 ° H Id + h 

is .^-rigid for 7i then a variable change yields equality 

||(/i-h*)|V/i|5||i. = ||n|V7i,i|5||^., 

thus giving by Lemma 2 the following bound on the residual motion norm: 

||ri|V/i,i|5|U^ < |f^"+'(rr(M-i)))"||h*|V7i|^||i.. (10) 

In view of equality (9) and inequality (10), iterating the motion estimation/ 
registration process looks completely natural and allows for pointing out suffi- 
cient conditions for convergence of such a process. Indeed, provided the same 
assumptions are made at each step, relations (9) and (10) can be seen as recur- 
rence ones, yielding both Vp and 7i^p sequences. 

3.3 Theoretical Iterative Scheme 

Having control on the residual motion after one registration step, we deduce the 
following theoretical iterative motion estimation / registration scheme: 

1. Initialization: Enter accuracy e > 0 and the maximal number of iterations 
N. Set p = 0, and 7i,o = h- 

2. Iterate while (||7i,p — 72|| > e p < N) 

(a) Enter the set of basis functions 'Up = (V'p,!)i=i..np that linearly and 
uniquely decompose Vp on the support of |V/i,p|. 

(b) Enter scale ap and compute: hp =< 'Fp, > . 

(c) Set 7i,p+i = 7i,p o {Id + hpy^. 

3.4 Convergence Theorem 

Now that we have designed an iterative motion estimation / registration scheme, 
let us infer sufficient conditions for the residual motion to vanish. This leads us 
to state our following main result: 

Theorem 2. If: 

1. For allp > 0, 7i_p ~ 72 (as defined in Seetion 2.1), and the residual motion rp 
can be linearly and uniquely decomposed on a set of basis functions {fip^ifi = 
1 • • } ? 

2. For all p > 0, there exists a scale Up > 0 sueh that the set of funetions 

i'4’p,i)fi = 1--Up} be free in ifi and, for p = 0, we assume that: 

\\h*\Vh\iU^ < (|ao"+2Tr(Mo,.J^)^'; 

Set Co = (^^a^+^Tr{Mo,,„)i\\h*\Vh\i\\L 2 ^ 



-1 
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3. The sequence of conditioning ratios satisfy criteria: 

vp > 0, < Co-, 



Tr{Mp 






4- For allp > 0, the estimated deformations Id+hp e Hom{D) and are ^-rigid 
for h^p; 



Then, linip^oo ||?'p|V/i,p|^/^||l 2 = 0. 
Proof. See [20] I 



3.5 Numerical Algorithm Requirements 

Firstly, due to the fact that h* is unknown we have to make an arbitrary choice 
for the scale at each step. Secondly we at least have to ensure that Id + h he 
invertible at each step. Finally we are faced with the motion basis functions 
choice. 



Multi-scale Strategy The scale choice expresses both a priori knowledge on 
the motion range and its structure complexity. Here we assume that (o-p)p is an 
increasing sequence, starting from cro > 0 such that: 

ifScrg > #{expected independent motions}. (11) 

Then let a g] 0, 1[. In order to justify the minimization problem at new scale 
o'p+i > o-p, we will choose it such that: 

WiFIap+i - nap)ih,p+i - /2)||l 2 > a||/i,p+i - (12) 



Invertibility of Id hp Let /3 > 0. We will apply to I\ p the inverse of the 

maximal invertible linear part of the computed deformation e.g. {^d + t* , 

where 



t* = sup {t / det{Jac{Id + t.hp)) > (3}. (13) 

te[o,i] 



Remark. 

Recursive version of the algorithm 

Set f*{Ii,l 2 ) the solution to the correspondence problem between and 
l 2 - Then, f*{Iip,l 2 ) = f*{Ii,p+i,l 2 ) ° [Id + tip). We thus deduce the following 
alternate recursive motion estimation / registration function f*{h,l 2 ) defined 
by: 

’ If ||I^i ^ .l2|| > e, 

{ Calculate h{I\,l 2 ) 

Deform: 7i,i = ho [Id + h{h,l 2 ))^^ 

Call/ = r(/i,i,/2) 

Return f o [Id + h{h,l 2 )) 

, Else return Id 
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Choosing the Set of Basis Functions A major difficulty arising in the theo- 
retical scheme comes from the lack of a priori knowledge on the finite set of basis 
functions to be entered at each step. To alleviate this problem we proposed two 
different approaches. In [20] we consider splitting both images into a collection 
of pairs of level sets to be matched. In Section 4 we will use an implicit approach 
via the optimal step gradient algorithm when minimizing the quadratic energy 
associated to MCE (6). 

4 Implicit Approach of Basis Functions 

We now use the optimal step gradient algorithm for the minimization of the 
quadratic functional associated to MCE (6). There are at least two good reasons 
for doing this: 

— the choice of base functions is implicit: it depends on the images I\ and / 2 , 
and the scale space. 

— we can control and stop the quadratic minimization if the associated operator 
is no longer positive definite. 

The general algorithm does not guaranty that the resulting matrix be 

invertible. Hence we suggest to systematically use a stopping criteria to control 
the quadratic minimization, based on the descent speed or simply a maximum 
number of iterations No- 

In that case our final algorithm writes: 

1. Initialization: Enter accuracy e > 0 and the maximal number of iterations 
N. Set p = 0, 7i,o = h, and choose first scale cro according to (11). 

2. Iterate while (||/i,p — / 2 II > e p < N & Cp < 1) 

(a) Choose Cp satisfying (12). 

(b) Apply No iterations of the optimal step gradient algorithm for the min- 
imization of 

Ep(h) = ||77,^(/i,p - I 2 ) - 

(c) Compute 7i,p+i = h.p ° {Id + with t* defined by (13) and 

increment p. 

In the following experiments we have fixed parameters to a = 2.5%, Nq = 5, 
P = 0 . 1 . 



Running the Algorithm 

We illustrate the algorithm on pairs of images with large deformation for regis- 
tration applications and movies for motion estimation applications. 

— Registration problems involving large deformation: In figures 4 and 
5 we show the different steps of the algorithm performing the registration 
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Fig. 4. Registration movie of a rotated rectangle: from left to right and from top to 
bottom we show the different steps of the algorithm performing the registration. 




Fig. 5. Registration movie of a target to a ’C’ letter. Again, each image corresponds 
to a step in the iterative scheme. 



between the first and last images. In Figures 6 to 8, we show the study and 
target images, and the deformed study image after applying the estimated 
motion. This was applied for two examples of faces and a turbulence image 
featuring a vortex at two different states. 

— Optical Flow estimation examples: in Figure 9 we show the sequence 
of the registered images of the original Cronkite sequence onto first image 
using the sequence of computed backward motions. The result is expected to 
be motionless. On top of Figure 10, we show the complete movie obtained by 
deforming iteratively only the first image of Cronkite movie. For that we use 
the sequence of computed motions between each pair of consecutive images 
of the original movie. In Figure 10 on the bottom, we see the error images. 

5 Conclusion 

We have addressed the theoretical problems of motion estimation and registra- 
tion of images. We have introduced a new local ridigity hypothesis that we used 
to infer a unique Motion Constraint Equation with small remainder at coarse 
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Fig. 6. Scene registration example: Study image (left), deformed Study image onto 
Target image (center), and Target image (right). 




Fig. 7. Registration of a face with two different expressions: Study image (left), de- 
formed Study image onto Target image (center), and Target image (right). 



scales. We then showed that upon hypotheses on the motion norm and struc- 
ture/scale tradeoff, an iterative motion estimation/registration scheme could 
converge towards the expected solution of the problem e.g. the global minimum 
of the nonlinear least square problem energy. Since each step of the theoretical 
scheme needs a set of motion basis functions which are not known, we have de- 
signed an implicit algorithm and illustrated the method with synthetic and real 
images, including large deformation examples. 
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Abstract. Spherical object reconstruction is of great importance, espe- 
cially in the field of cell biology, since cells as well as cell nuclei mostly 
have the shape of a deformed sphere. Fast, reliable and precise proce- 
dure is needed for automatic measuring of topographical parameters of 
the large number of cells or cell nuclei. This paper presents a new method 
for spherical object reconstruction. The method springs from Delingette 
general object reconstruction algorithm which is based on the deforma- 
tion of simplex meshes. However, the unknown surface is searched only 
within the subclass of simplex meshes, which have the shape of a star. 
Star-shaped simplex meshes are suitable for modelling of spherical or el- 
lipsoidal objects. In our approach, the law of motion was altered so that 
it preserves the star-shape during deformation. The proposed method is 
easier than the general method and therefore faster. In addition, it uses 
more computationally stable expressions than a method strictly imple- 
mented according to Delingette’s paper. It is also shown how to partly 
avoid the occasional instability of the Delingette method. The accuracy 
of both methods is comparable. The star-shaped method achieves a sta- 
ble state more often. 



1 Introduction 

Image analysis has recently become an essential part of many scientific disci- 
plines. One of them is cytometry which performs measurements on cells and 
their components (cell nuclei, chromosomes, etc.). The shape of cell nuclei is 
mostly similar to a sphere or more generally to an ellipsoid. There is a need of 
having a general method for reconstruction of objects of this type. 

Many techniques for reconstruction of cell nuclei are based on thresholding 
[12,3,9]. The resultant boundary of an object is often represented as a set of 
pixels (or a set of voxels in three-dimensional studies). Isosurfacing methods 
based on marching cubes algorithm [11] are also used [10,7]. However, these 
methods do not handle missing and noisy data and can be hardly used for 
images with clusters of nuclei (e.g. tissue), because they make no assumption 
about the shape to recover. 

Deformable modelling is capable of working with a priori knowledge about 
the shape of object and therefore it can deal with missing and noisy data. The 
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3D extension of Kass active contour model [8] has been applied to segmentation 
of CLSM tissue images [1]. However, the definition of minimised energy does not 
consider the spherical nature of nuclei. A priori knowledge about roundness is 
involved only in the shape of the initial model. Therefore the model can easily 
stick to artifacts or other objects in the neighbourhood. 

The active contour method used in [2] is based on global minimisation. The 
search space is bounded by two concentric circles centralised upon a point found 
by an initial rough segmentation. The knowledge about the roundness is ex- 
ploited during the construction of the search space and therefore it influences 
the energy minimisation. Unfortunately, this highly successful method for the au- 
tomatic segmentation of cervical cell nuclei is designed only for two-dimensional 
space. 

In this paper an original method for 2D and 3D reconstruction of cell nuclei 
boundaries (i.e circular or spherical objects) is described. However, this algo- 
rithm can be also used for reconstruction of some other “star-shaped” objects. 
The method is built on the Delingette general object reconstruction algorithm 
[5]. Delingette has used simplex meshes [4] for representation of surface. We 
discovered a new subclass of simplex meshes, which we call star-shaped sim- 
plex meshes. This subclass has some nice features and it is very convenient for 
spherical object reconstruction. 

Simplex meshes and star-shaped simplex meshes are defined in Section 2.1. 
Section 2.2 introduces important notations and some geometric properties of 
simplex meshes. In Section 2.3 law of motion for star-shaped simplex meshes 
is presented, since it differs from the law of motion used for general simplex 
meshes. In Section 3 the biomedical application for which our method could be 
used is outlined. 

2 Method 

2.1 Definition of Simplex Meshes and Star-Shaped Simplex Meshes 

Delingette’s definition of /c-simplex mesh embedded in an Euclidean space 
is cited [4,5] in the beginning of this section and then definition of star-shaped 
subclass follows. 

A 0-ceZlof also called vertex is a point of . A 1-ceZZof also called edge 
is an unordered pair of distinct vertices of . A p-eell {p > 2) C is recursively 
defined as a set of {p — l)-cells such that: 

1. Every vertex belonging to C belongs to p distinct {p — l)-cells. 

2. The intersection of two {p — l)-cells is either empty or is a (p — 2)-cell. 

A 2-cell is therefore a closed polygonal line of and is called face. Examples 
of p-cells are shown in Fig. 1. 

A k-simplex mesh is simply defined as a (fc + l)-cell of 

We denote the convex hull of p-cell C by CH{C). Let O be an arbitrary point 
of The cone of apex O with respect to p-cell C ofW^ is a union of all rays 
which begin at O and intersect CH[C). This cone is denoted by /C(0,C). 
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Fig. 1. Examples of some p-cells, from left to right are 0-cell (vertex), 1-cell (edge), 
2-cell (face) and 3-cell. 




Fig. 2. Example of star-shaped 1-simplex mesh of (left) and star-shaped 2-simplex 
mesh of (right). There exists a point O inside the mesh such that any ray from O 
through any vertex P intersect the mesh only in point P. See text for exact definition. 



Let the fc-simplex mesh M consists of n fc-cells Cj, i £ {1, . . . , n}. 

The fc-simplex mesh Ai is star-shaped if and only if there exists a point O in 
M'* such that 

1. if two /c-cells Cj and Cj intersect at {k — l)-cell C then the intersection of 
cones K.{0,Ci) and Kt{0,Cj) is exactly the cone JC{0,C). 

2. if two /c-cells Ci and Cj do not intersect then the intersection of cones IC{0, Ci) 
and /C(0,Cj) is exactly the point O. 

The point O is called centre. 

Examples of star-shaped simplex meshes are in Fig. 2. 

Since the class of star-shaped /c-simplex meshes is a subclass of /c-simplex 
meshes, they have the same properties as the general /c-simplex meshes [4]. 
Important property of /c-simplex mesh is that each vertex has exactly /c + 1 
neighbouring vertices. Another properties are recalled in section 2.2. 

The short term simplex mesh will be used instead of 2-simplex mesh of 
in the rest of this article. 



2.2 Geometry of Simplex Meshes 

Basic notation and important geometric quantities of simplex meshes (2-simplex 
meshes of M^) are introduced in this section. Similar results hold for 1-simplex 
meshes of [5] . In our method the simplex angle is the most important quantity. 
It is used to control the desired shape of the simplex mesh. 
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Fig. 3. (left) The circumscribed sphere Ki of radius Ri and of centre Oi. The circum- 
scribed circle Si of radius ri and of centre Ci. Vector jii is the unit normal vector of 
the plane defined by three neighbours of vertex Pi. (right) Projection of left figure. The 
geometrical meaning of the simplex angle fi is illustrated. 



A vertex Pj of a simplex mesh has three neighbouring vertices Pvi(i), ^^2(1) 
and PNiii) ■ These three points define a plane and the normal vector of this plane 
is denoted iij. Orientation of the normal vector is to the outer side of surface. 
We introduce the circumscribed circle Si to the triangle P/Vi(,)P/v2(i)-^A^3(d- 
circle is of centre C^ and of radius r^. We introduce also the sphere Ki of centre 
O, and radius Ri, which is circumscribed to the four vertices Pi, Pni(i), Pn2(i)^ 
Pmiz) (see Fig. 3 ). 

The simplex angle tpi = ^{Pi, Pni{i)j PN2{i)j ^Naii)) at Pj is defined by fol- 
lowing equations: 

(fi e 



sin{ipi) = ^sign{{PN^{i) - Pi) ' w) (1) 

cos{ipi) = sign{{C, - O,) ■ {Pi - C,)) 

iXi 

The simplex angle ipi is independent of position of vertices P/Vi(i), Pn^H)^ 
Pwsii) on the circle S'* and of Pj on a hemisphere of Ki. It is zero when Pj is 
on the plane defined by its three neighbours. The value of the simplex angle 
is invariant to rotation, translation and scale transformation. The geometric 
meaning of simplex angle is pictured in Fig. 3 b. 

There is a simple relationship between the simplex angle and the curvature 
\Hi\ = also called the mean curvature at vertex Pj [ 4 ]: 





( 2 ) 
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2.3 Deformation of Star-Shaped Simplex Meshes 

Law of Motion. We have used the same law of motion that was used for general 
simplex meshes by Delingette [5]. Vertices of a simplex mesh are considered as 
a physical mass submitted to a Newtonian law of motion including internal and 
external forces: 



a i', ari „ „ 

w ~ ^ ^ ^ ext (3) 

where m is the vertex mass and 7 is the damping factor. F,„( is the internal 
force constraining the shape of the mesh. Feait is the external force constraining 
the distance between the mesh and tridimensional dataset. The term — 7 ^ 
represents the counteraction of environment in which the mass is embedded. 

The evolution of the simplex meshes under the law of motion (3) can be 
computed in the following manner [5]. Time t is discretized and using central 
finite differences with an explicit scheme, the law of motion has form: 

=P* + {1- j){P^ - + a,F„t + AFe,t (4) 

Both forces Fj„t and Fext are computed at time t. This explicit scheme is 
conditionally stable and therefore parameters a* and Pi must belong to a given 
interval to guarantee a stable scheme. In Eq. 4 the forces have the dimension of 
a displacement. 

The motion of the star-shaped simplex mesh is confined in the following 
manner. We choose any proper centre O of the star-shaped simplex mesh and 
mark it as the centre of deformation. It means that the only allowed motion of 
vertices is along the rays from this centre. These rays are called deformational 
rays. 

Deformation along the deformational rays preserves the star-shaped quality. 
The forces defined in [5] could be used with the only one change to achieve 
this type of deformation. Instead of the forces one can take their projections to 
the deformational rays. However, we will redefine this forces in order to make 
computations on the star-shaped simplex meshes faster and more stable. 



Internal Force Computation. Internal force expressions for simplex meshes 
are not derived from minimisation of a global functional. The geometric param- 
eters at a vertex are related to the vertex position with a complex non-linear 
relationship. Therefore, minimisation of a global functional expressed in terms of 
the geometric parameters would lead to unnecessarily complex force expressions 
[5]. Delingette proposed simplified regularising force formulae at the expense of 
not having a global functional for guiding the minimisation. 

The internal force of deformable simplex meshes is defined as the composition 
of a tangential force and a normal force. The goal of the tangential force is to 
control the vertex position with respect to its three neighbours in the tangent 
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plane^ . The tangential foree is not needed for deformation of star-shaped simplex 
meshes. The deformational rays play the role of this force. 

The normal force is acting in order to change the mean curvature of surface 
according to the required geometrical continuity or the required loeal shape. 
These requirements are expressed by means of reference simplex angle Re- 
member that there is a simple relation between the simplex angle and the mean 
curvature (Eq. 2). The reference simplex angle is determined for each vertex 
before each iteration or at the beginning of minimisation proeess. Possibilities 
of computing are discussed later. 

Delingette used the following formula for eomputation of the normal force. 

^norm ~ (fi) L(^J (^) 



where 



L{ri,d^,ipi) 



(r| - ifl)tan{ipi) 

n + 5\/r‘^ + (r^ - df)tan‘^{ipi) 



( 6 ) 



Not yet defined symbol d.j is equal to distance between eenter C., and the or- 
thogonal projection of onto the neighbouring triangle (P^vR?:)) PN 3 {^))■ 

Factor 5 = 1 if |(^j| < tt/2 and 5 = — 1 if \p^\ > tt/ 2. If the equation 6 is used for 
minimisation computation is unstable for pi near 0 or tt/2 Since L{xi, dj, tpi) eom- 
putes distance between P, and the neighbouring triangle 
equation 5 can be evaluated to be more suitable for stable computation 



^norm — do Ri) (d^iVi(i) ^i) ' ‘ (d) 

Using this equation stability of normal force eomputation depends only on the 
value of reference simplex angle f>i. 

Internal force of star-shaped simplex meshes does not act in the normal 
direetion but in the direetion of a deformational ray and is defined by 

F^nt = P,- Pz ( 8 ) 



where Pi is the vertex on which the force is acting and Pi is the point, which lies 
on the deformational ray and its simplex angle with respect to the neighbours 
of Pj is equal to the reference simplex angle (see Fig. 4). 

The point Pi can be computed as a intersection of the deformational ray 
through P, and the part of sphere on which all points has their simplex angles 
with respect to the neighbours of P, equal to the reference simplex angle. The 
intersection is given by 

P, = 0 + vR (9) 

where O is the center of deformation, v, = ||p'~Q|| is the unit directional vector 
of the deformational ray through vertex Pj and i is equal to 

the plane which pass Pi and is parallel to the plane defined by three neighbours of 

P 



1 
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Fig. 4. Computation of internal force Fi„t. The force is acting along the ray O + Vit 
in order to move the vertex Pi with simplex angle to the new position Pi where 
simplex angle is equal to the reference simplex angle (pi. 



f = V, • (O - O,) + + •(O-O0)'-||O-O.||2 (10) 

where 5 — 1 \i < Q and (5 = — 1 otherwise, i?,; = | | and Oi = Ci + 

niB,iCosifi. 

The reference simplex angle ipi can be computed in the following manners 

— constraint, (p^ is set to ^p^ and therefore no internal force is acting. The 
surface can freely bend around vertex T). 

— constraint, (pi is set to 0 and accordingly each vertex moves towards its 
center of curvature in the case of Delingette method or towards the center 
of deformation in the case of star-shaped meshes. 

— shape constraint. <pi is set to constant value p^. This case is important for 
spherical shape reconstruction. The constants are calculated according to 
initial surface (sphere or ellipsoid). 

— simplex shape continuity or constraint, p^ is set to an average value of 
the simplex angles at neighbouring vertices. 



External Force Computation. Problem of external force computation is 
broadly and deeply discussed in [5] . The expression of the external force depends 
on the nature of the input data. When star-shaped simplex meshes are being 
used, it is important to let the external force (as well as the other forces) act 
in the direction of deformational rays. The method which we used for external 
force computation is mentioned in the next section. 
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Fig. 5. (Left) A xy slice of the HL-60 input dataset. Volume of the nucleus is stained. 
(Right) A xy slice of the HT-29 input dataset. Nuclear envelope of the nucleus is 
stained. Some stain was washed away from the envelope and only its part is visible. 



3 Application to Cell Nuclei Segmentation 

Segmentation and reconstruction of the boundaries of cell nuclei is essential in 
many biological studies. Reliable 3-D model of nucleus is for example necessary 
for the measurement of some topographical parameters (e.g. distance between 
nucleus boundary and genes.). Measured topographical parameters are statisti- 
cally evaluated and therefore a large number of nuclei must be processed in an 
efficient and reliable manner. Visualisation of the reconstructed model of nucleus 
boundary is also important for understanding its shape. 

We have performed nucleus reconstruction on two types of input images. 

HL-60 - DAPI stained chromatin of 3-D fixed HL-60 cells, i.e the whole volume 
of nuclei is visualised. A typical xy slice of this input dataset is in Fig. 5 (left). 
Voxels with the highest gradient intensity determine the nucleus boundary. 
HT-29 - nuclear envelope of 3-D fixed nuclei obtained from stabilised cell line 
of human colon adenocarcinoma HT-29. The nuclear envelope was visualised 
using Lamin B. A typical xy slice of the input dataset is in Fig. 5 (right). 
The voxels of the highest intensity values correspond to the nuclear envelope. 
Because the nuclear envelope is eroded during the process of specimen prepa- 
ration, only parts of the nuclear envelope are clearly visible. Reconstruction 
algorithm have to deal with missing data. The voxels with the lower inten- 
sities corresponds to the places, from which the Lamin B was not properly 
washed away. 

Two methods which were described in the previous section were applied on 
mentioned data: 

SS - Star shaped simplex meshes (section 2). 

GM - General method proposed by Delingette[5] with the change in normal 
force computation (stable formula 7 was used). 

Shape constraint was used for determination of reference simplex angle in the 
both methods. The reference simplex angles of vertices were set to the values 
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Table 1. Comparison of SS and CM methods (HL-60 input dataset). Testing was ac- 
complished for fifteen differnt combinations of parametres a, /3 and 7 for each method. 
Results for SS method are on the first line in each cell and results for CM method 
are on the second line, n is the number of iterations to achieve stable state, t is time 
of deformation and /r is mean distance between the input points and the final mesh. 
Symbol 00 means that computation was oscillating. 



a, 13 


Damping factor 7 


0.1 


0.25 


0.5 


0.75 


1 


n t n 


n t fjL 


n t n 


n t n 


n t n 


0.9, 0.1 


66 12.54 4.68 
62 14.88 4.8 


63 11.97 4.67 
60 14.4 4.82 


69 13.11 4.8 
87 20.88 4.85 


32 6.08 5.1 
24 5.76 5.55 


25 4.75 6.23 
25 6.0 6.18 


0 . 8 , 0.2 


68 12.92 4.11 
502 120.48 4.2 


43 8.17 4.15 
72 17.28 4.24 


56 10.64 4.22 
45 10.8 4.26 


74 14.06 4.26 
62 14.88 4.29 


8315.77 4.3 
68 16.32 4.32 


0.7, 0.3 


75 14.25 3.79 
00 00 ? 


44 8.36 3.88 
00 00 ? 


58 11.02 3.88 
102 24.48 3.89 


72 13.68 3.89 
118 28.32 3.9 


94 17.86 3.9 
76 18.24 3.95 



computed from the initial surface. Hence, the internal force was keeping the 
initial spherical (ellipsoidal) shape. 

Input datasets were preprocessed before deformation process. Gaussian 
smoothing and intensity normalisation [13] was used. Input images were filtered 
by LoG (Laplacian of Gaussian) instead of Gaussian in experiment with HL-60 
cells. A mK D-tree [14] was constructed for N voxels with the highest intensities 
(hundreds on each slice was enough in our application), m-closest points to given 
point can be obtained efficiently using rriKD-trees. An ellipsoid was taken as the 
initial surface. The ellipsoid generation was based on ellipse fitting [6] on each 
slice. 

An external force was computed as a projection of the following force 
F onto the deformational ray trough given vertex Pj. 

m 

f = ( 11 ) 

where Xj are m closest points to given vertex and Wj is defined by 

{ 0 , Xj — Pi > const 
(x^-p-y ’ otherwise 

where I{Xj) is intensity of point Xj. 

Deformation was computed until the change of mesh was insignificant. Each 
method was performed on each dataset for fifteen times and always with different 
parameters a, (3 and 7. Dumping factor 7 has iterated over values 0.1, 0.25, 0.5, 
0.75 and 1 and the couple {a, (3) over (0.9, 0.1), (0.8, 0.2) and (0.7, 0.3). Three 
values were logged for every execution of each method: 

— The number of iterations n needed to achieve stable state, 

— the time of deformation t and 
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Table 2. Comparison of SS and GM methods (HT-29 input dataset). This table is 
organized in the same manner as table 1. 
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47 


10.38 


12.38 
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11.24 
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70 
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8.57 
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? 


00 


00 
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Fig. 6. Final meshes in HL-60 experiment, 7 = 0.5, a = 0.9, /3 = 0.1 (Left) SS method 
(Right) GM method. 



— mean distance /r between the input points and the final mesh. 

Times were measured on processor Pentium III 500 MHz, RAM 128 MB and OS 
Debian Linux. Results are summarized in tables 1 and 2 . 

Tables show that accuracy expressed by means of the mean distance between 
model and data is very similar for both methods. However, on the other side 
it is impossible to decide exactly what result better approximates the border of 
real cell nucleus. An expert in biology have to make a selection what method 
and what parameters are most appropriate for a given application. 

A lot of experiments with GM method have not converged. We have inves- 
tigated this problem of oscillations and discovered that all oscillations were due 
to tangential forces. Therefore we believe that the reason why the star-shaped 
method have not ever oscillated is that there are no tangential forces. However 
this statment we are not able to proof at this moment. 

Sum of a and /3 is intensionally equal to one. Changing the ratio of this 
two parameters influences on minimisation process. This ratio express balance 
between the internal and external forces. User can change the shape of spherical 
object by handling this ratio and drive model either more closer to input data 
or more closer to initial spherical model. An a, /3 ratio handling can be used for 
segmentation of noisy or pure data. General method is for a, (3 ratio handling 
less suitable than star-shaped method because it oscillate for some a and ( 3 . 
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Fig. 7. Results in xy slices, HL-60 experiment, 7 = 0.5, a = 0.9, /3 = 0.1, SS method. 
The final surface was projected onto the input data. 
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Fig. 8 . Results in xz slices, HL-60 experiment, 7 = 0.5, a = 0.9, /9 = 0.1, SS method. 
The final surface was projected onto the input data. 




Fig. 9. Final meshes in HT-29 experiment, 7 = 0.5, a = 0.9, f3 = 0.1 (Left) SS method 
(Right) GM method. 



Fig. 6 shows some final meshes of both methods for HL-60 dataset. A number 
of vertices of simplex meshes was 2700. In fig. 7 and 8 the projection of final 
mesh obtained by the application of SS method onto some xy and xz slices of 
input image is depicted . 
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Fig. 10. Results in xy slices, HT-29 experiment, 7 = 0.75, a = 0.9, (3 = 0.1, SS method. 
The final surface was projected onto the input data. 




Fig. 11. Results in yz slices, HT-29 experiment, 7 = 0.75, a = 0.9, (3 = 0.1, SS method. 
The final surface was projected onto the input data. 



Fig. 9 shows some final meshes of both methods for HT-29 dataset. In fig. 10 
and 11 the projection of final mesh obtained by the application of SS method 
onto some xy and yz slices of input HT-29 image is depicted. 

4 Conclusion 

A new method for spherical object reconstruction based on deformation of star- 
shaped simplex meshes was proposed. The new method provides similar results 
as a Delingette general method [5], from which we drew inspiration. However, 
the new method is easier model and therefore it is faster. In adition, it has 
achieved a stable state more often then general method. We will study method 
further, because there are some not answered questions. The most important is, 
for which values of coefficients a and j3 is the minimisation stable. Answer to 
this question is important to further progress of the new method. 
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Abstract. Gabor feature space is elaborated for representation, process- 
ing and segmentation of textured images. As a first step of preprocessing 
of images represented in this space, we introduce an algorithm for Gabor 
feature space denoising. It is a geometric-based algorithm that applies 
dillusion-like equation derived from a minimal weighted area functional, 
introduced previously and applied in the context of stereo reconstruction 
models [6,12]. In a previous publication we have already demonstrated 
how to generalize the intensity-based geodesic active contours model to 
the Gabor spatial-feature space. This space is represented, via the Bel- 
trami framework, as a 2D Riemannian manifold embedded in a 6D space. 
In this study we apply the minimal weighted area method to smooth 
the Gabor space features prior to the application of the geodesic active 
contour mechanism. We show that this ’’Weighted Beltrami” approach 
preserves edges better than the original Beltrami diffusion. Experimental 
results of this feature space denoising process and of the geodesic active 
contour mechanism applied to the denoised feature space are presented. 

Keywords: Gabor analysis. Geometric-based algorithms, Geodesic ac- 
tive contours, Beltrami framework. Anisotropic diffusion, image mani- 
folds, minimal weighted area method. 



1 Introduction 

Textured image segmentation is an important issue in image analysis. However, 
real world textures are difficult to model. Among the approaches to the analysis 
of textures are local geometric primitives [9] , local statistical features [3] , random 
field models [8,4] and the FRAME theory [23] which combines filtering theory 
and Markov random field modeling through the maximum entropy principle. 
Another approach, based on the human visual system has emerged, in which 
texture features are extracted using Gabor filters [19]. 

The motivation for the use of Gabor filters in texture analysis is double fold. 
First, it is believed that simple cells in the visual cortex can be modeled by Gabor 
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functions [16,5], and that the Gabor scheme provides a suitable representation 
for visual information in the combined frequency-position space [18]. Second, the 
Gabor representation has been shown to be optimal in the sense of minimizing 
the joint two-dimensional uncertainty in the combined spatial- frequency space 
[7]. The analysis of Gabor filters was generalized to multi- window Gabor filters 
[24] and to Gabor-Morlet wavelets [18,24,17,15], and studied both analytically 
and experimentally on various classes of images [24] . 

A great deal of attention has been devoted in recent years to the ” snakes” , or 
active contour models, which were proposed by Kaas et al [10] for intensity based 
image segmentation. In this framework an initial contour is deformed towards the 
boundary of an object to be detected. The evolution equation is derived from 
minimization of an energy functional, which obtains a minimum for a curve 
located at the boundary of the object. A major drawback of the classical snakes 
algorithm is its dependence on the parameterization of the curve. This may 
actually lead to different results for different choices of parameterization. 

The geodesic active contours model [2,11] offers a different perspective for 
solving the boundary detection problem; It is based on the observation that 
the energy minimization problem is equivalent to finding a geodesic curve in a 
Riemannian space whose metric is derived from image contents. The geodesic 
curve can be found via a parameterization invariant geometric flow. Utilization 
of the Osher and Sethian level set numerical algorithm [20] allows automatic 
handling of changes of topology. 

It was shown recently that the Gaborian spatial-feature space can be de- 
scribed, via the Beltrami framework [22], as a 4D Riemannian manifold [13] 
embedded in IR®. Based on this approach, the intensity based geodesic active 
contours method was generalized to the Gabor-feature space of images [21]. It 
was shown that the geodesic snakes mechanism can be used for texture segmen- 
tation when applied to the Gabor spatial feature space of images rather than 
the intensity images themselves. The metric introduced in the Gabor space was 
used to derive the inverse edge indicator function E, which attracts in turn the 
evolving curve towards the boundary in the geodesic snakes schemes. Once the 
Gabor feature space of an image is derived, the scale and orientation for which 
the maximum amplitude of the transform was obtained are kept for each pixel. 
Thus, for each pixel, the maximum value of the Gabor transform coefficient and 
the orientation and scale that yield this maximum value are obtained. This ap- 
proach results in a 2D manifold embedded in a 6D space. It was shown that 
using this approach the geodesic snakes yield good results when the textures are 
homogeneous and can be characterized by these maximum values. 

However, the maximum values provide only partial information regarding 
image structure in the full Gabor feature space. This may, in turn, generate 
less than satisfactory results in case of more complex textures. One solution 
to this problem is to apply the geodesic snakes mechanism to the complete 
Gabor feature space and interpret the Gabor transform of an image as a function 
assigning for each pixel’s coordinates, scale and orientation, a value. Thus, the 
Gabor transform of an image may be viewed as a 4D manifold embedded in IR®. 
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An alternative solution is to improve the results obtained from the 2 D manifold 
embedded in 6D space approach which we aim to achieve here. 

We apply the weighted area minimization method to improve the results for 
the orientations which were determined by searching for the maximum value of 
the Gabor coefhcients. We show that it better preserves edges than the Beltrami 
smoothing operator. 

This paper is organized as follows: In section 2 we briefly review the Bayesian 
formulation in the context of image processing. In section 3 we describe the 
geodesic active contours method for intensity images. Next, in section 4 we de- 
scribe the generation of the Gabor feature space. In section 5 we show how to 
apply the geodesic snakes mechanism in the Gabor feature space. In section 6 
we describe the weighted area minimization method, and finally in section 7, we 
provide some preliminary results. 



2 Bayesian Formulation 



The Bayesian approach is useful in finding a compromise between the require- 
ments of fidelity of a given image data, and our a priori knowledge or assump- 
tions regarding the nature of “true” images. Accordingly, we consider an image 
to be made of an ensemble of interacting systems-i.e. pixels, wherein the gray 
level of each pixel is a realization of a random process. In other words, the gray 
level of each pixel is drown from a probability distribution that depends on the 
value of the given noisy image, as well as on a priori information reflecting as- 
sumptions about the structure and properties of natural images. For example, 
and in particular, the smoothness assumption can be interpreted as the “mean 
free path” of interactions among the above-mentioned pixel generating systems, 
resulting in some kind of a weighted averaging in a neighborhood of the pixel. 
The likelihood of an image, given the noisy data set of an image, is obtained by 
multiplication of the likelihood functions of all the pixels’ gray levels. Given a 
pixel at the coordinates (xj,y^), according to Bayes rule 



yi) |.^o (^2 1 y^)) — 



Pxiyi (.fo (^2 1 Vi) I 5 y? ) ) Pxiyi y?)) 

Pxiyi yO) 



( 1 ) 



and 

P{I\Io)= n Px,vAIix^,y^)\Io{x^,y^)), (2) 

iJeNxN 

where N is the size of the image, and in the left hand side of both (1) and (2) we 
have the posteriori probability distribution of either a pixel value (eq. 1) or of 
the entire image (eq. 2) that we wish to compute; Namely the probability of the 
gray value I{xi,y^) (or of I), given the data 7o(xj,y,) (or Iq). This distribution 
is calculated in the right hand side of both (1) and (2) as the probabilities 
of measuring Io{xi,y.i) (or of Iq), given the “true” image, multiplied by the 
probability of I{xi, y^) (/) being the true image. In other words, this second term 
reflects our prior assumption on the distribution of I{x,y). The denominator 
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depends only on Iq{x^, yi) and therefore does not affect the optimization process 
of I{x,y). 

One often assumes a Gibbsian distribution, in which case the conditional 
probability becomes 

Pxy{A\B) = exp(-ae(A,S)). 

where e{A, B) is an “energy density”. Given this type of conditional probability 
equation (2) becomes 



C(^|/o) 



n 

iyj^NxN 

exp(— a 



^Xiyi (-^0 ; Vi) 1 yi'}'}Pxiyi ? J/z)) 

^Xiyi ii Vi')') 

J {e{I,Io) ~ e{Io))dxdy). 



( 3 ) 



Determining which is the image that maximizes the posteriori probability, is 
equivalent to the selection of the image that minimizes the energy. 

Our study generalizes this framework of the statistical approach to images, by 
considering the probability distribution of texture features and not only (and in 
the examples given herewith not at all) of the pixels’ gray levels. We also choose 
somewhat non-standard fidelity term and smoothing term. A special form is as- 
sumed such that the two terms collapse into one. The technique is borrowed from 
recent results in stereo reconstruction models [6,12] and our prior assumption is 
that textures (and/or other image features) are piecewise uniform. 



3 Geodesic Active Contours 

In this section we review the geodesic active contours method for non-textured 
images [2]. The generalization of the technique for texture segmentation is de- 
scribed in section 4. 

Let C(q) : [0, 1] — > IR^ be a parametrized curve, and let I : [0,a] x [0,6] — > 
IR^ be the given image. Let E{r) : [0, cx)[— !> be an inverse edge detector, so 

that E approaches zero when r approaches infinity. Visually, E should represent 
the edges in the image. Minimizing the energy functional proposed in the clas- 
sical snakes is generalized to finding a geodesic curve in a Riemannian space by 
minimizing: 

Ln = j E{\VI{C{q))\)\C'{q)\dq. (4) 

We may see this term as a weighted length of a curve, where the Euclidean 
length element is weighted by E{\VI{C{q))\). The latter contains information 
regarding the boundaries within the image. The resultant evolution equation is 
the gradient descent flow: 

^^ = S(1V/])A:N-(VL;-N)N, (5) 



where k denotes curvature. 
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If we now define a function U, so that C = {{x,y)\U{x,y) = 0), we may use 
the Osher-Sethian Level-Sets approach [20] and replace the evolution equation 
for the curve C, with an evolution equation for the embedding function U : 



dU{t) 

dt 



|V17|Div E{\VI 



vu \ 



( 6 ) 



A popular choice for the stopping function i?(|V/|) is given by: 



E{\WI\) 



1 

1 + |V/|2’ 



however, other image-specific functions may be used. 



4 Feature Space and Gabor Transform 

The Gabor scheme and Gabor filters have been studied by numerous researchers 
in the context of image representation, texture segmentation and image retrieval. 
A Gabor filter centered at the 2D frequency coordinates [U,V) has the general 
form of: 

h{x, y) = g{x' , y') exp(27ri(C/ x + Vy)) (7) 

where 

{x\ y') = {xcos{(f)) + ysm{(f)), — xsin(</)) + ycos{(f))), (8) 

= (9) 

A is the aspect ratio between x and y scales, a is the scale parameter, and the 
major axis of the Gaussian is oriented at angle <f) relative to the x-axis and to 
the modulating sinewave gratings. 

Accordingly, the Fourier transform of the Gabor function is: 

H{u,v) = exp - U'fx^ + {v' - V'f)^ (10) 

where, {u',v') and {U',V) are rotated frequency coordinates. Thus, H{u,v) is 
a bandpass Gaussian with its minor axis oriented at angle cf) from the u-axis, 
and the radial center frequency F is defined by: F = , with orientation 

0 = arctan(F/{7). Since maximal resolution in orientation is desirable, the filters 
whose sinewave gratings are cooriented with the major axis of the modulating 
Gaussian are usually considered {<p = 9 and A > 1), and the Gabor filter is 
reduced to: h{x,y) = g{x' ,y')exp{2TriFx'). 

It is possible to generate Gabor-Morlet wavelets from a single mother-Gabor- 
wavelet by transformations such as: translations, rotations and dilations. We can 
generate, in this way, a set of filters for a known number of scales, S, and orienta- 
tions K. We obtain the following filters for a discrete subset of transformations: 

/ / 

hmn{x, y) = ^), where {xJ , y') are the spatial coordinates rotated by 
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^ and m = 0...S — 1. Alternatively, one can obtain Gabor wavelets by loga- 
rithmically distorting the frequency axis [18] or by incorporating multiwindows 
[24] . In the latter case one obtains a more general scheme wherein subsets of the 
functions constitute either wavelet sets or Gaborian sets. 

The feature space of an image is obtained by the inner product of this set of 
Gabor filters with the image: 

W^n{x,y) = Rmn{x,y) +iJmn{x,y) = I{x,y) * hmn{x,y). ( 11 ) 

5 Application of Geodesic Snakes 

to the Gaborian Feature Space of Images 

The proposed approach enables us to use the geodesic snakes mechanism in the 
Gabor spatial feature space of images by generalizing the inverse edge indicator 
function which attracts in turn the evolving curve towards the boundary in 
the classical and geodesic snakes schemes. A special feature of our approach is 
the metric introduced in the Gabor space, and used as the building block for the 
stopping function E in the geodesic active contours scheme. 

Sochen et al [22] view images and image feature space as Riemannian man- 
ifolds embedded in a higher dimensional space. For example, a gray scale im- 
age is a 2D Riemannian surface (manifold), with {x,y) as local coordinates, 
embedded in with (A, Y, Z) as local coordinates. The embedding map is 
(A = X, y = y, Z = I{x, y)), and we write it, by abuse of notations, as (x, y, I). 
When we consider feature spaces of images, e.g. color space, statistical moments 
space, and the Gaborian space, we may view the image-feature information as 
a A-dimensional manifold embedded in a A + M dimensional space, where N 
stands for the number of local parameters needed to index the space of interest 
and M is the number of feature coordinates. For example, we may view the Gabor 
transformed image as a 2D manifold with local coordinates (x,y) embedded in a 
6D feature space. The embedding map is {x, y,9{x, y),a{x, y), R{x, y), J{x, y)), 
where R and J are the real and imaginary parts of the Gabor transformed image, 
and 9 and a as the direction and scale for which a maximal response has been 
achieved. Alternatively, we can represent the Gabor transform space as a 4D 
manifold with coordinates (x, y, 9, a) embedded in the same 6D feature space. 
The embedding map, in this case, is {x,y,9,a,R{x,y,9,a),J{x,y,9,a)). The 
main difference between the two approaches is whether 9 and <j are considered 
to be local coordinates or feature coordinates. 

A basic concept in the context of Riemannian manifolds is distance. Con- 
sider, for example, we take a two-dimensional manifold A with local coordinates 
(o'i,cr 2 )- Since the local coordinates are curvilinear, the distance is calculated 
using a positive definite symmetric bilinear form called the metric whose com- 
ponents are denoted by 5 /^ 1 / (ui, U 2 ): 

ds^ = 9niyda^da‘' , ( 12 ) 

where we used the Einstein summation convention: elements with identical su- 
perscripts and subscripts are summed over. 
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The metric on the image manifold is derived using a procedure known as 
pullback. The manifold’s metric is then used for various geometrical flows. We 
shortly review the pullback mechanism. More detailed information can be found 
in [22]. 

Let X S ^ M he an embedding of X in M, where M is a Riemannian 
manifold with a metric hij and X is another Riemannian manifold. We can use 
the knowledge of the metric on M and the map X to construct the metric on 
X. This pullback procedure is as follows: 



= h.,,{X[a\a^)) — — , (13) 

where we used the Einstein summation convention, i,j = 1, . . . , dim(M), and 
are the local coordinates on the manifold X. 

If we pull back the metric of a 2D image manifold from the Euclidean em- 
bedding space (x,y,I) we get: 



{9tiv{x,y)) 



1 + 4 " 

44 



44 
1 + 4 



(14) 



The determinant of g^i, yields the expression: 1 + 4^ + 4"- Thus, we can rewrite 
the expression for the stopping term E in the geodesic snakes mechanism as 
follows: 

1 + |V/|2 = det(g^4- 

We may interpret the Gabor transform of an image as a function assigning to each 
pixel’s coordinates, scale and orientation, a value (W). Next, we get the scale and 
orientation for which we have received the maximum amplitude of the transform 
for each pixel. Thus, for each pixel, we obtain: Wmax^ the maximum value of the 
transform, 9max and (Jmax ^ the orientation and scale that yielded this maximum 
value. This approach results in a 2D manifold (with local coordinates (x,y)) 
embedded in a 6D space (with local coordinates {x,y, R{x,y), J{x,y),9{x,y), 
a{x, y)). If we use the pullback mechanism described above we get the following 
metric: 



. \ _ f 1 + + 4 + + 4 RxRy + JxJy + <Xx<Jy + 9x9y \ ... 

\ RxRy + 44 + '^x^'y + OxOy 1 + Ry + Jy + <Jy By J 

We use the fact that the determinant of the metric is a positive definite edge 
indicator to determine E as the inverse of the determinant of Here <7^^ is 
a function of the two spatial variables only x and y, therefore, we obtain an 
evolution of a 2D manifold in a 6D embedding space. 



6 Smoothing of the Orientation Data by Application 
of the Weighted Area Minimization Method 

In the previous section we have described how the Gabor feature space can 
be treated as a 2D manifold embedded in 6D space. We have used a maximum 
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criterion to obtain a single orientation and scale for each pixel location. However, 
this information does not always well represent the textural information and is 
sensitive to local variations in the texture characteristics. Therefore, the resultant 
orientation data can be quite noisy. Also, some random noise can deteriorate the 
resultant data. Our aim is to reduce the amount of noise in the orientation data 
and obtain a smoother function to be used in the geodesic snakes mechanism. 

We obtain the Gabor feature coefficients as a function of x,y,9{x,y) and 
a{x,y). This discussion is devoted to the manipulation of 9, therefore we select 
a single scale a and generate a set of Gabor filters for that scale, which differ 
in their orientation. Thus, the generated Gabor feature space is a function of 
X, y, 9{x, y). Our aim is to reduce the amount of noise in 9, whether its source is 
a heterogeneous texture or some random noise. We define an energy functional 
which minimizes the magnitude of the Gabor coefficients function weighted by 
an area element determined by x, y, 9{x, y). 



S(0) = J D{x,y,9)^Jg{9^,9y)dxdy (16) 

where 

is a data fidelity term and g is the determinant of 



i9nu) = 



(^ + ^1 

V 



(^xSy \ 

l + 9l)- 



(17) 



The combination ^/gdxdy, an area element of the orientation manifold {x,y, 
9{x,y)), is the term that forces smoothing as the orientation field reduces its 
overall area when it flows towards the optimal solution. For trivial data term 
the gradient descent process is the Beltrami flow that ignores any data edges 
that are not already very pronounced in the initial noisy guess. On the other 
hand a trivial metric for the orientation manifold results in decoupling of the 
different orientation values in different locations as the metric is the only place 
where derivatives of 9 may appear. This decoupling leads to a simple solution: 
At each pixel the orientation for which maximum response is achieved is chosen. 
As we have noted above this leads to a noisy solution that may undermine the 
correctness of the segmentation process. 

The constant c in the denominator has two roles: The first role is merely 
numeric, to avoid division in zero. The second role has a geometrical meaning 
since this constant determines the convergence properties of our scheme. If this 
constant is very small, the evolution depends more on the values of the Gabor 
transform, + J^), and the smoothing of 9 is less dominant. If the constant is 
very large compared with the values of the Gabor transform, then they are less 
dominant in the evolution, and the smoothing of 9 is the same as in the Beltrami 
scheme. 

Considering the Bayesian formulation we notice that we may rewrite the en- 
ergy density as e = {D — const) ^ + const y/g. We note that the first term is 
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a fidelity term that forces 9 to align according to the orientation in the noisy 
original image while the second term pushes towards a minimal surface solution. 
Note that the ^/g in the first term means that the fidelity term is to be thought 
of as a function on the orientation manifold. Choosing the same constant for 
both terms leads to the functional S written above. Note that we do indeed 
generalize the formalism by considering features, i.e. orientations in this specific 
implementation, rather than intensity. Our assumption is that images are piece- 
wise continuous with respect to all the relevant image features/attributes. In the 
case of textureless images, i.e. gray level only, this continuity becomes identical 
to smoothness. 

Thus, we process the manifold 9{x, y), while obtaining the maximum value for 
the Gabor coefficients, so that the contribution and impact of each component 
leads to satisfactory result. 

Using the Euler-Lagrange method we obtain the following equation: 

^ f W \ {R‘^ + J^)0{x,y,e{x,y))^ 

69 + J‘^){x,y,9{x,y))^ ) + PY{x,y,9{x,y)) 

According to the steepest descent method the evolution equation for 9 is: 



(18) 



(19) 



7 Results and Discussion 

The Beltrami flow, being a nonlinear diffusion scheme, offers advantages in pro- 
cessing and analysis of images compared with linear diffusion. In the context of 
the present study it preserves edges more accurately. We use a similar approach 
to the Beltrami flow, where the Gaborian orientation data, 0, is treated as a 2D 
manifold embedded in 3D space, {x,y, 9). Our aim is to smooth the orientation 
information when accounting for the maximal Gabor coefficients obtained. Fol- 
lowing the minimal weighted area diffusion, we can use 9 as the input to the 
geodesic snakes algorithm. Geodesic snakes is an efficient geometric flow scheme 
for boundary detection, where the initial conditions include an arbitrary function 
U which implicitly represents the curve, and a stopping term E which contains 
the information regarding the boundaries in the image. We generalize the defi- 
nition of gradients, usually considered in the context of intensity gradients over 
(x, y) to other possible gradients in scale and orientation. This gradient infor- 
mation is the input function E to the newly generalized geodesic snakes flow. 

Next, we present the results of the minimal weighted area method compared 
to the Beltrami scheme. [For the complete set of full size images and a demo see 
the web-page: http:/ /www-visl.technion.ac.il/emmcvpr2001]. 

In this study we have generated the Gabor wavelets for eight orientations, 
the scale being kept constant. In the geodesic snakes mechanism U was initiated 
to be a signed distance function [2]. 
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Fig. 1. An image of textures taken from the Brodatz album of textures [1] . The circular 
object is generated from the background texture after rotation by 30 degrees. 



The first image (Fig. 1) is taken from the Brodatz album of textures [1]. 
The circular object is generated from the background texture by rotating it 
by 30 degrees. We apply the Gabor transform to this image and obtain the 
maximal values of the Gabor coefficients per pixel, and the orientation for which 
the maximal values were obtained. In figure (2(a)) we see that the orientation 
information is a piece-wise constant function and that it clearly captures the 
boundary between object and background. In figure (2(b)) we see the orientation 
information after random noise was added to it. When the Beltrami flow is 
applied to the noisy orientation image, if the edges are to be better preserved, 
we should compromise on the degree of smoothing of the background, as can 
be seen in figure (2(c)). If further smoothing is desired, the edges are smeared 
(Fig. 2(d)). When the Gabor coefficients are accounted for, we obtain a high 
degree of smoothing while preserving the sharpness of the edges (Fig. 2(e)). 

The inter-relations between the Beltrami flow and the weight of the Gabor 
coefficients can be seen in the next example. By changing the constant value in 
the denominator from values larger than the mean value of E? + (equivalent 
to the Beltrami flow) to smaller than the mean value we control the impact of 
the Beltrami numerator to the Gabor denominator. 

The second image is similar to the first one, however, here the rotation is 
done by 45 degrees (Fig. 3). In figure (4(a)) we see the relevant orientation data 
after application of the Gabor transform and obtaining the maximal values of 
the Gabor coefficients per pixel. As we did before, we add random noise to the 
orientation information (4(b)), and apply the smoothing procedures to remove it. 
Next, we evaluate the effect of changing the constant value. When the constant 
is big in comparison to the average value of B? + J^, the weighted minimal 
area method is equivalent to the Beltrami flow (Fig. 4(c)). As we decrease this 
constant the impact of the Gabor coefficients is more evident and we obtain 
the same degree of smoothing without damaging the edges (Fig. 4(d,e)) . This 
is because the weighing of the Gabor coefficients in the Beltrami flow tends 
to keep the edges better than when applying the original Beltrami flow, where 
the only constraint is on the smoothing of the 9 manifold. However, when the 
constant value is in the range of the Gabor coefficients are very dominant 

comparing to the smoothing of the 9 manifold, and the evolution of 9 can be 
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Fig. 2. a. The original orientation information following the application of the Gabor 
transform and using the maximum criteria (top left), b. The orientation information 
following addition of random noise (top right), c. Results of the Beltrami diffusion 
when the process is halted so that significant edges are still evident (middle left), d. If 
further smoothing is desired, the edges are smeared when the Beltrami diffusion is 
applied (middle right), e. The result obtained following application of the minimal 
weighted area method (bottom). 
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Fig. 3. This image is taken from the Brodatz album of textures [1]. The circular object 
is generated from the background texture by rotating it by 45 degrees. 
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Fig. 4. a. The original orientation information following the application of the Gabor 
transform and using the maximum criteria (top left), b. The orientation information 
following addition of random noise (top right). The results of application of the weighted 
minimal area method are presented, where the mean value of B? + is about 200. 
The difference between the results is the value of the constant in the denominator, c. c 
=10,000 (middle left), d. c = 800 (middle right), e. c = 600 (bottom left), f. c = 200 
(bottom right). 




Fig. 5. This image is a synthesized texture composed of linear combination of spatial 
sinewave gratings of different orientations where some random noise was added to it. 
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Fig. 6. The maximal values of the Gabor coefficients per pixel are obtained along with 
the relevant orientation information, a. The magnitude of the Gabor coefficients (left), 
b. The orientation information (right). 



led to local minima. This is manifested in the white dots that appear when the 
constant is equal to 200 (Fig. 4(f)). 

In the next example, we demonstrate how the different smoothing processes 
affect the results of the geodesic snakes mechanism. The original image is a 
synthesized texture composed of linear combination of spatial sinewave gratings 
of different orientations where some random noise was added to it (Fig. 5). 

After application of the Gabor filters the maximal value of the Gabor coef- 
ficients per pixel is calculated (Fig. 6(a)) and the orientation image obtained is 
noisy (Fig. 6(b)). When the Beltrami flow is applied to the noisy orientation im- 
age we obtain a smooth result (Fig. 7(a)). However, the edges are more dominant 
when the Gabor coefficients are accounted for (Fig. 7(b)). 

Next we use the smoothed 6 obtained to calculate the stopping term E in the 
geodesic snakes mechanism. The stopping term obtained following the Beltrami 
diffusion is seen in figure (7(c)), and the one obtained following the minimal 
weighted area method can be seen in figure (7(d)). The resultant boundaries 
obtained can be seen in figure (7(e+f)). The most evident difference between 
the two results can be seen on the top right hand side of the boundaries. It is 
clear that using the minimal weighted area method the edges are better captured 
and detected. 

Gurrently we expand this study to the other Gaborian features, such as the 
scale parameter a and the sine grating frequency F. We explore the behavior of 
each parameter when considered separately, and also the coupling between these 
parameters. Another natural continuation of this work is to apply the results of 
Kimmel and Sochen [14] in order to obtain a more robust orientation diffusion 
where the orientations manifold is embedded in 0 5^. By properly choosing 
the local coordinate systems for both manifolds the problem arising from the 
cyclic nature of angles is addressed. 
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Abstract. A number of geometric active contour and surface models 
have been proposed for shape segmentation in the literature. The es- 
sential idea is to evolve a curve (in 2D) or a surface (in 3D) so that it 
clings to the features of interest in an intensity image. Several of these 
models have been derived, using a variational formulation, as gradient 
flows which minimize or maximize a particular energy functional. How- 
ever, in practice these models often fail on images of low contrast or 
narrow structures. To address this problem we have recently proposed 
the idea of maximizing the rate of increase of flux of an auxiliary vector 
fleld through a curve. This has lead to an interpretation as a 2D gra- 
dient flow, which is essentially parameter free. In this paper we extend 
the analysis to 3D and prove that the form of the gradient flow does not 
change. We illustrate its potential with level-set based segmentations of 
blood vessels in a large 3D computed rotational angiography (CRA) data 
set. 



1 Introduction 

Level-set based numerical methods for hyperbolic conservation laws developed 
by Osher and Sethian [14] for curvature-dependent flame propagation were intro- 
duced to the computer vision community for shape analysis by Kimia et al. [8]. 
Such models were later adapted to the problem of shape segmentation indepen- 
dently by Caselles et al. [3] and Malladi et al. [13]. Here the essential idea was 
to halt an evolving curve in the presence of intensity edges by multiplying the 
evolution equation with an image-gradient based stopping potential. This led to 
new active contour models which, when implemented using level set methods, 
handled changes in topology due to the splitting and merging of multiple con- 
tours in a natural way. These geometric flows for shape segmentation were later 
given formal motivation as well as unified with the classical energy minimization 
formulations through several independent investigations [4,7,16,17]. The main 
idea was to modify the Euclidean arc-length or the Euclidean area by a scalar 
function and to then derive the resulting gradient evolution equations. Mathe- 
matically this amounted to defining a new metric on the plane, tailored to the 
given image, and then deriving the corresponding gradient flows. The results 
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generalized to the case of evolving surfaces in 3D by adding one more dimension 
to the variational formulation. 

Recently there have been other advances in the use of geometric flows in 
computer vision, which have both theoretical and practical value. First, it has 
been recognized that a practical weakness of most geometric flows with stopping 
terms based purely on local image gradients is that they may “leak” in the 
presence of weak or low contrast boundaries, are not suitable for segmenting 
textures and typically require the initial curve or curves to lie entirely inside or 
outside the regions to be segmented. Thus, a number of researchers have sought 
to derive flows which take into account the statistics of the regions enclosed by 
the evolving curves [15,21]. Further developments include multi-phase motions, 
which allow triple points to be captured [5], as well as the incorporation of an 
external force field based on a diffused gradient of an edge map [20]. Second, 
most geometric flows are not able to capture elongated low contrast structures 
well, such as blood vessels viewed in 2D and 3D angiography images. At places 
where such structures are narrow, edge gradients may be weak due to partial 
volume effects and it is also unclear how to robustly measure region statistics. 
Approaches to regularizing the flow in 3D by introducing a term proportional 
to mean curvature have the unfortunate effect of annihilating such structures. 
To address this issue, Lorigo et al. have proposed the use of active contours 
with co-dimension 2 (curves in 3D) [12]. The idea is to regularize the flow by 
a term proportional to the curvature of a 3D curve. The approach is grounded 
in the level set theory for mean curvature evolution of surfaces of arbitrary co- 
dimension [1] and has a variational formulation along with an energy minimizing 
interpretation. However, the derived flow is later modified with a (heuristic) 
multiplicative term to tailor it to blood vessel segmentation [12]. 

We have recently suggested an alternate approach to segmenting blood vessels 
in angiography images, which is motivated by the observation that blood flows 
in the direction of vessels. Brightness in angiography images is proportional to 
the magnitude of the blood flow velocity. This leads to the constraint that in 
the vicinity of blood vessel boundaries, the gradient vector held of the image 
should be locally orthogonal to them. Thus, a natural principle to use towards 
the recovery of these boundaries is to maximize the inward flux of the gradient 
vector held through an evolving curve (in 2D) or surface (in 3D). The derivation 
of the 2D flux maximizing flow was presented in [18] and lead to an elegant 
interpretation which is essentially parameter free. In the current paper we prove 
that the extension to 3D has the same form, a calculation which is more subtle. 
We also illustrate the potential of the 3D flux maximizing flow with several new 
simulations of blood vessels segmented from a large 3D computed rotational 
angiography (CRA) data set. 



2 3D Flux Maximizing Flows 

Let S : [0, 1] x [0, 1] — > TZ^ denote a compact embedded surface with (local) 
coordinates (u, v). Let Af be the inward unit normal. We set 
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• rA ? 

OU 



S,i : — 



dv 



Then the infinitesimal area on S is given by 



dS = (||5„||2||5„f - {Su,S,fy/^dudv 
= ||iS„ A Sy\\dudv. 



Let V = (Vi(a;, y, z), V 2 {x, y, z), Vs(x, y, z)) be a vector field defined for each point 
{x,y,z) in TZ^. The total inward flux of the vector field through the surface is 
defined by the surface integral 

rA{t) 

Flux{t)= / {V,Af)dS, (1) 

Jo 

where A{t) is the surface area of the evolving surface. The main contribution of 
the current paper is the proof of the following theorem 

Theorem 1. The direction in which the inward flux of the vector field V through 
the surface S is increasing most rapidly is given by ^ = div{V)M. 

It turns out that the flux maximizing flow has the same form in 2D, a calculation 
which we presented in [18]. However, the proof is more subtle for the 3D case. 

Proof: The essential idea is to calculate the first variation of the flux functional 
with respect to t: 

,-A{t) i-A(t) 

Flux' {t)= / {Vt,.N')dS+ / {V,Nt)dS. 

Jo Jo 

h l2 



With S = (x{u,v,t),y{u,v,t), z{u,v,t)), the unit normal vector is given by the 
normalized cross product of two vectors in the tangent plane: 



Jfl = 



Sy A Sy 
||lS„ A Sy 

\\{yuZv - 



{Ni,N2,N3) 
||iSjj A iS„j| 



VvZu\ {XyZy Xy 

yyZyf}^ {XyZy Xy 



Zy)f \Xyyv 
Zy \ {XyPy 



Xyyu) 

Xyyu)\ 



( 2 ) 



I\ is then given by 

1 

{St, {NiVVi + N2W2 + N3VV3)) dudv, 

where the integrand is the inner product of St with another vector. We shall now 
simplify I 2 so that it takes on a similar form. It turns out to be advantageous to 
express the unit normal vector in Eq. (2) as || and expand it in terms of 

the partial derivatives Xy,Xy,yu,yy, Zy, Zy only later. With dS = \\SuASy\\dudv, 
I 2 can be rewritten as 




(V, {Sy A Sy)fl dudv. 



1 



1 
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The trick now is to eploit the fact that for any vectors A, B and C, the following 
properties of inner products and cross products hold: 



A A B = -B A A 
{A,iBAC)) = {{AAB),C) 

{AAB)t = {AtAB) + {AABt). 



Hence, I2 can be written as 
/■I M 

h = 



(V, {Sut AS^+SuA S^t)) dudv 

I I {V , (Sut A Sy)) dudv + I I {V,{Su ASyt)) dudv 
Jo Jo Jo Jo 



0 Jo 
1 /■! 



10 Jo 

rl rl 



10 JO 
fl 



{V,{Sy ASut)) dudv ^ [ [ {V,{Sy ASyt)) dudv 

Jo Jo 

= I f - {{V ASy),Syi)du dv-\- [ [ {{V ASu),Syt)d 

Jo Uo J ^0 L^o 



du 



I3 



Using integration by parts, I3 works out to be 



-{{VASy),St)]l+f {St,iVASy)Jdl 

" V " ^0 



equals 0 

Similarly, using integration by parts, works out to be 

{{VASy),St)]l- [ {St,{VASy)Jdv 
" V " Jo 

equals 0 

Combining Is and B, I2 works out to be 

(5t,(VA5,)„-(VA5J„)du(iu. 

It can now be seen that the integrand in I2 has the desired form of the inner 
product of St with another vector. Hence, combining I\ and I2, the first variation 
of the flux is 

{St, NiVVi + N 2 VV 2 + N 3 VV 3 + (V A - (V A Sy)J dudv. 

Note that 





(V A - (V A = (V„ A 5,„) + (V A Syy) - (V A 5„,) - (V, A 5„) 
= {Vu A Sy) — {Vy A Sy). 
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Hence, the first variation of the flux can be written as the surface integral 

fViVHi + N 2 VV 2 + mWs + jVu A S,) - (V, A \ 

1 1 A 5^, 1 1 / 

Thus, for the inward flux to increase as fast as possible, the two vectors should 
be made parallel: 




+ N2VV2 + NsWs + (Vu A 5^) - (V^ A 
1 1 Su A 5^ 1 1 



( 3 ) 



The above expression for the 3D flux maximizing gradient flow can be further 
simplified by noting that the components of the flow in the tangential plane to 
the surface S affect only the parametrization of the surface, but not its evolved 
shape. Hence, they can be dropped. The normal component of the flow can be 
calculated by taking the inner product of the right hand side of Eq. (3) with the 
unit normal vector in Eq. (2) to give 



c _ /NiVVi + N2VV2+N3VV3 + {V^/\S^)-{V^AS^) \^r 

~ \ |||S„AS„||| ’ 1II‘S„AS„||| /■''' 

It is now a straightforward task to expand the terms in the expression by using 
Eq. (2): 



^ - VvZu) + Vly{XyZu ~ XuZv) + Vl^{XuVv ~ X^Vu)) 

+ i^2xiyuZv - VvZy) + V 2 y{XyZy, ~ X ^Zy) T V 2 ^{Xuyy ~ 7/ „ ) ) 

|||5 a 5 III^ i^Sxi.yuZy l/yZu) T ^ 3 y(XyZy XyZy) T ^ 3 ziXuVv 

T II ^ ilp" ( (I^U A Sy ) , {yyZy VyZy , XyZy XyZy , Xyl^y Xy'ljy')') 

||5 AS |j[^ ( A 7 i^VuZy VvZy , XyZy XyZy , Xy^y Xyl/y)'^ . 

With 



A Sy (^^viy^2x^u ^2yUu ^2z^u'} Vviy^Sx^u ^SyVu '^Sz^u'} 1 

^v{,^3x^U ^SyVu ^Sz^u') ^v{,^lx^U ^lyVu Z^u') ■) 

Vviy^lx'^U ^lyVu ^Iz^u') ~ •^viy^2x^U ^2y]Ju ^^Z'^'u))- 



Vt, A Sii — ( Zuiy^2x^V ^2yyv ^2z^v'} Vuiy^Sx^V ^SyVv H~ ^Sz^v'}^ 

^uiy^Sx^V ^SyVv ^Sz^v') “I” ^uiy^lx^V ^lyVv "A '^Iz'^l^)? 

-VuiVlx^v + VlyVv + Vlz^v) + Xu{V 2 ^Xy + V2yyv + ^2z^^)) 

the terms can be grouped and simplified. The curious result is that most cancel, 

leaving the following simple and elegant form for the 3D fiux maximizing fiow: 

St = (Vi^ + V2y + Vsz)J^ = dw{V)Af n (4) 
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Remark: The flux maximizing flow is a hyperbolic equation since it depends 
solely on the external vector held V and not on properties of the evolving surface. 
It is easy to see that the flow will drive towards and then converge to a zero level 
set of the divergence of V. Thus, the existence and uniqueness of a solution to 
Eq. (4) is guaranteed, unless the vector field is everywhere non-conservative. 

3 Blood Vessel Segmentation 

3.1 Background 

We shall now show how the 3D flux maximizing flow can be tailored to the prob- 
lem of segmenting blood vessels in angiography images. We begin by reviewing 
some of the recent approaches which have been proposed in the literature. Wil- 
son and Noble have introduced a Gaussian mixture model to characterize the 
physical properties of blood flow [19]. The parameters are estimated using the 
EM algorithm and structural criteria are then used to refine the initial segmen- 
tation. Krissian et al. propose a method which incorporates a Gaussian model 
for the intensity distribution as a function of distance from vessel centerlines, 
and exploits properties of the Hessian to obtain geometric estimates [11]. Koller 
et al. have also introduced a multi-scale method for the detection of curvilinear 
structures in 2D and 3D data [9] which combines the responses of steerable linear 
filters and also exploits the Hessian matrix to obtain geometric estimates. Bullitt 
et al. have introduced a method for obtaining 3D vascular trees which calculates 
vessel centerlines as intensity ridges in the data and estimates vessel width via 
medialness calculations [2]. It should be noted that several of the above ap- 
proaches require second derivative computations, e.g., to compute the Hessian. 
Numerically accurate estimates of principal curvature magnitudes and direc- 
tions are obtained only when the intensity images have been suitably smoothed. 
Approaches to smoothing the data while preserving and enhancing vessel-like 
structures include [10,6]. 

Whereas the potential of several of the above approaches has been empirically 
demonstrated, their ability to recover low contrast thin vessels remains unclear. 
A recent framework which has been developed with this as one of its goals is the 
work of Lorigo et al. [12]. The main idea is to regularize a geometric flow in 3D 
using the curvature of a 3D curve, rather than the classical mean curvature based 
regularizations which tend to annihilate thin structures. The work is grounded 
in the recent level set theory developed for mean curvature flows in arbitrary 
co-dimension [1]. This flow is given by [12]: 

V-t = A(W>, VV) + p(VV'.VI)^ 

9 l|VI|| 

Here V' is an embedding surface whose zero level set is the evolving 3D curve, 
A is the smaller nonzero eigen value of a particular matrix [1] g is an image- 
dependent weighting factor, I is the intensity image and H is its Hessian. Eor 
numerical simulations the evolution of the curve is depicted by the evolution of 
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an e-level set. It should be noted that without the multiplicative factor p(VV'-VI) 
the evolution equation is a gradient flow which minimizes a weighted curvature 
functional. The multiplicative factor is a heuristic which modifies the flow so 
that normals to the e-level set align themselves (locally) to the direction of im- 
age intensity gradients (the inner product of Vi/j and VI is then maximized). 
However, with the introduction of this term the flow loses its pure energy mini- 
mizing interpretation. 

3.2 The 3D Flux Maximizing Flow 

The intuition behind using the 3D flux maximizing flow for blood vessel seg- 
mentation is illustrated in Figure 1. Here a cross section through an idealized 
blood vessel (a bright region in a uniform darker background) is depicted. It is 
clear that if one considers the gradient VI of the original intensity image I to 
be the vector field V whose inward flux through the evolving surface is to be 
maximized, then the optimal configuration is for the evolving surface to align 
itself locally to the blood vessel boundaries. However, an important considera- 
tion in the implementation of Eq. (4) is that since the divergence of the vector 
field needs to be calculated, implicitly second derivatives of I are being used. 
The numerical computation can be made much more robust by exploiting a con- 
sequence of the divergence theorem. The divergence at a point is defined as the 
net outward flux per unit volume, as the volume about the point shrinks to zero. 
Via the divergence theorem, 

f div(V)dv= /< V,7V>dS. (5) 

J Av J s 

Here Av is the volume, s is its bounding surface, Af is the outward normal at 
each point on the surface, and dv and dS are volume and surface area elements. 

For our numerical implementations we use this outward flux formulation 
(which gives a measure proportional to the divergence) along the boundaries of 
spheres of varying radii, corresponding to a range of 3D blood vessel widths. 
The chosen flux value at a particular location is the maximum (magnitude) 
flux over the range of radii. In contrast to other multi-scale approaches where 
combining information across scales is non-trivial [11] normalization across scales 
is straightforward in our case. One simply has to divide by the number of entries 
in the discrete sum that approximates Eq. (5). Locations where the total outward 
flux is negative (or equivalently the total inward flux is positive) correspond to 
sinks; locations where the total outward flux is positive correspond to sources, 
as illustrated in Eigure 1. Hence, the inward flux maximizing flow in Eq (4) has 
the desirable effect that, when seeds are placed within blood vessels, the sinks 
drive the seeds towards the vessel boundaries and the sources outside prevent 
the flow from leaking. 

3.3 Level Set Implementation 

In order to implement the flow, we use the level set representation for compact 
embedded surfaces, due to Osher and Sethian [14]. Let S{u, v, t) : [0, 1] x [0, 1] x 
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Fig. 1. An illustration of the gradient vector field of an angiography image along a 
cross-section of a 3D blood vessel. Assuming a uniform background intensity, at the 
scale of the vessel’s width the total inward flux is positive (a sink) . Outside the vessel, 
at a smaller scale, the total inward flux is negative (a source). 



[0, t) — > be a family of compact embedded surfaces evolving according to the 

surface evolution equation 

St = FN, 

where F is an arbitrary (local) scalar speed function and J\f is inward unit normal 
to the surface. Then it can be shown that if 5(u, u, t) is represented by the zero 
level set of a smooth and Lipschitz continuous function F : x [0, t) R, the 

evolving hypersurface F satisfies 



Ft = F llVlf'll . 

This last equation is solved using a combination of straightforward discretization 
and numerical techniques derived from hyperbolic conservation laws [14]. For 
hyperbolic terms, care must be taken to implement derivatives with upwinding 
in the proper direction. The evolving surface S is then obtained as the zero level 
set of F. 

4 Examples 

In earlier work we have presented simulations of the flux maximizing flow on 
both 2D (retinal) and 3D (head) MRA data [18]. In this section we present new 
experiments on a 360 x 330 x 420 computed rotational angiography (CRA) data 
set of the head, from which we have selected four distinct regions containing 
vascular networks of varying complexity. All examples were implemented in a 
level-set framework. 

The evolution results are presented in Figures 2, 3, 4 and 5. For each region 
a maximal intensity projection of the data is shown on the top left, followed by 
the evolution of a few 3D spheres. These spheres were placed somewhat sparsely 
in regions of high flux. Notice how the spheres elongate in the direction of blood 
vessels. The main blood vessels, which have the higher inward flux, are the 
flrst to be captured. This is the expected evolution since it maximizes the rate 
of increase of inward flux through the evolving surface. Put another way, the 
evolution has the intuitive behaviour that it follows the direction of blood flow 




Fig. 2. An illustration of the flux maximizing flow for a portion of a 360 x 330 x 420 
3D CRA image of blood vessels in the head. A maximum-intensity projection of the 
region being viewed is shown on the top left. The other images depict the evolution of 
a few isolated spheres. Notice how the evolution follows the direction of blood flow to 
reconstruct the blood vessel boundaries. 



to reconstruct the blood vessel boundaries. Our own experience with developing 
and implementing some of the related geometric flows in the literature is that 
many would fail in low contrast regions or would not be able to capture the 
thinner vessels. 











Fig. 3. An illustration of the flux maximizing flow for a portion of a 360 x 330 x 420 
3D CRA image of blood vessels in the head. A maximum-intensity projection of the 
region being viewed is shown on the top left. The other images depict the evolution of 
a few isolated spheres. Notice how the evolution follows the direction of blood flow to 
reconstruct the blood vessel boundaries. 



We should also point out that although CRA data is of higher resolution 
than MRA, the vessel structures exhibit a wider range of intensities and there 
are also a number of other structures whose intensities overlap with those of the 
thin vessels. Thus, simple thresholding of the intensity data generally gives poor 
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Fig. 4. An illustration of the flux maximizing flow for a portion of a 360 x 330 x 420 
3D CRA image of blood vessels in the head. A maximum-intensity projection of the 
region being viewed is shown on the top left. The other images depict the evolution of 
a few isolated spheres. Notice how the evolution follows the direction of blood flow to 
reconstruct the blood vessel boundaries. 



results, although this is a commonly used initialization step in many algorithms 
including the approach of [12]. This point is illustrated in Figure 6. The first row 
shows the results of a high threshold on the four regions, where as one would 
expect, many thin low contrast vessels are not captured. As the threshold is 
decreased, more thin vessels are captured, but also many voxels are incorrectly 
labeled as vessels (Figure 6, second row). The segmentation results obtained by 
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Fig. 5. An illustration of the flux maximizing flow for a portion of a 360 x 330 x 420 
3D CRA image of blood vessels in the head. A maximum-intensity projection of the 
region being viewed is shown on the top left. The other images depict the evolution of 
a few isolated spheres. Notice how the evolution follows the direction of blood flow to 
reconstruct the blood vessel boundaries. 



the flux maximizing flow are repeated in the last row. The arrows point to some 
of the thin low contrast vessels that are successively captured, but are not seen 
even in the low threshold case. 

5 Conclusion 

In recent work we proposed the flux maximizing flow and derived its form for the 
case of closed curves evolving in the plane [18]. We also suggested that its form 
remains the same in higher dimensions. The main contribution of the current 
paper is the formal derivation of the flux maximizing flow in 3D. We have also 
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Fig. 6. A comparison of the segmentation results obtained by the flux maximizing flow 
with simple thresholding on the four different regions of the CRA image. First Row: 
A conservative high threshold fails to capture many thin low contrast vessels. SECOND 
Row : A lower threshold captures some of the thinner vessels but also incorrectly labels 
many voxels. Third Row: The segmentation results obtained by the flux maximizing 
flow. The arrows point to some of the thin low contrast vessels that are successively 
captured, but are not seen even in the low threshold case. 



carried out a number of new simulations on a large CRA data set. These have 
the intuitive behaviour that the evolution follows the direction of blood flow 
to reconstruct blood vessel boundaries. The results suggest the potential of the 



3D Flux Maximizing Flows 649 



approach to capture low contrast thin vessels, and also illustrate the advantages 
of the method over thresholding the original intensity image, which is a common 
initialization step in many vessel reconstruction algorithms. 

More work remains to be done to validate this technique against ground truth 
or expert segmentations, and we are beginning to do this in collaboration with 
our colleagues in medical imaging. It would also be interesting to see whether 
a regularization term such as that used in [12] could be incorporated in the 
derivation of the flux maximizing flow from first principles. 

Acknowledgements This work was supported by grants from the Canadian Foundation 
for Innovation, FCAR Quebec and the Natural Sciences and Engineering Research 
Council of Canada. We are grateful to Vincent Hayward, Terry Peters and David 
Holdsworth for giving us access to the vessel CRA data. 



References 

1. L. Ambrosio and H. M. Soner. Level set approach to mean curvature flow in 
arbitrary codimension. Journal of Differential Geometry, 43:693-737, 1996. 

2. E. Bullitt, S. Aylward, A. Liu, J. Stone, S. K. Mukherjee, C. Coffey, G. Gerig, and 
S. M. Pizer. 3d graph description of the intracerebral vasculature from segmented 
mra and tests of accuracy by comparison with x-ray angiograms. In IPMI’99, pages 
308-321, 1999. 

3. V. Caselles, F. Catte, T. Coll, and F. Dibos. A geometric model for active contours 
in image processing. Numerische Mathematik, 66:1-31, 1993. 

4. V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. In ICCV’95, 
pages 694-699, 1995. 

5. T. Chan and L. Vese. An efficient variational multiphase motion for the mumford- 
shah segmentation model. In Asilomar Conference on Signals and Systems, Octo- 
ber 2000. 

6. A. Frangi, W. Niessen, K. L. Vincken, and M. A. Viergever. Multiscale vessel 
enhancement filtering. In MICCAI’98, pages 130-137, 1998. 

7. S. Kichenassamy, A. Kumar, P. Olver, A. Tannenbaum, and A. Yezzi. Gradient 
flows and geometric active contour models. In ICCV’95, pages 810-815, 1995. 

8. B. B. Kimia, A. Tannenbaum, and S. W. Zucker. Toward a computational theory 
of shape: An overview. Lecture Notes in Computer Science, 427:402-407, 1990. 

9. T. M. Roller, G. Gerig, G. Szekely, and D. Dettwiler. Multiscale detection of 
curvilinear structures in 2-d and 3-d image data. In ICCV’95, pages 864-869, 
1995. 

10. K. Krissian, G. Malandain, and N. Ayache. Directional anisotropic diffusion applied 
to segmentation of vessels in 3d images. In International Conference On Scale Space 
Theories in Computer Vision, pages 345-348, 1997. 

11. K. Krissian, G. Malandain, N. Ayache, R. Vaillant, and Y. Trousset. Model-based 
multiscale detection of 3d vessels. In CVPR’98, pages 722-727, 1998. 

12. L. M. Lorigo, O. Faugeras, E. L. Grimson, R. Keriven, R. Kikinis, A. Nabavi, and 
C.-F. Westin. Codimension-two geodesic active contours for the segmentation of 
tubular structures. In CVPR’2000, volume 1, pages 444-451, 2000. 

13. R. Malladi, J. A. Sethian, and B. C. Vemuri. Shape modeling with front propaga- 
tion: A level set approach. IEEE Transactions on Pattern Analysis and Machine 
Intelligence, 17(2):158-175, February 1995. 




650 



Kaleem Siddiqi and Alexander Vasilevskiy 



14. S. J. Osher and J. A. Sethian. Fronts propagating with curvature dependent 
speed: Algorithms based on hamilton-jacobi formulations. Journal of Computa- 
tional Physics, 79:12-49, 1988. 

15. N. Paragios and R. Deriche. Geodesic active regions for supervised texture seg- 
mentation. In ICCV’99, pages 926-932, September 1999. 

16. J. Shah. Recovery of shapes by evolution of zero-crossings. Technical report. Dept, 
of Mathematics, Northeastern University, Boston, MA, 1995. 

17. K. Siddiqi, Y. B. Lauziere, A. Tannenbaum, and S. W. Zucker. Area and length 
minimizing flows for shape segmentation. IEEE Transactions on Image Processing, 
7(3):433-443, 1998. 

18. A. Vasilevskiy and K. Siddiqi. Flux maximizing geometric flows. In ICCV’2001, 
July 2001. 

19. D. L. Wilson and A. Noble. Segmentation of cerebral vessels and aneurysms from 
mr aniography data. In IPMI’97, pages 423-428, 1997. 

20. C. Xu and J. Prince. Snakes, shapes and gradient vector flow. IEEE Transactions 
on Image Processing, 7(3):359-369, 1998. 

21. A. Yezzi, A. Tsai, and A. Willsky. A statistical approach to snakes for bimodal 
and trimodal imagery. In ICCV’99, pages 898-903, September 1999. 




Author Index 



Abrantes, Arnaldo J. 576 
Aguiar, Pedro M.Q. 34 
Al-Rawi, Mohammed 216 
Arvo, James 528 
Aubert, Gilles 344 
August, Jonas 497 

Bengoetxea, Endika 454 
Ben Hamza, A. 19 
Bicego, Manuele 75 
Blanc-Feraud, Laure 344 
Bloch, Isabelle 454 
Boykov, Yuri 359 
Bruckstein, Alfred M. 185 
Buhmann, Joachim M. 235 

Camion, Vincent 513 
Castellani, Umberto 91 
Cheeseman, Peter C. 105 
Cohen, Laurent D. 560, 592 

Deschamps, Thomas 560 
Dias, Jose M.B. 375 
Dipanda, Albert 480 
Dovier, Agostino 75 

Fieguth, Paul 314 
Fischer, Bernd 235 
Fusiello, Andrea 91 

Ceiger, Davi 544 
Cimel’farb, Ceorgy 169 

Hancock, Edwin R. 251, 438 
Hirani, Anil N. 528 
Huntsberger, Terry 328 

Ikeuchi, Katsushi 298 

Jie, Yang 216 

Kamijo, Shunsuke 298 
Kanade, Takeo 118 
Keuchel, Jens 267 
Kolmogorov, Vladimir 359 
Krim, Hamid 19 



Kubota, Toshiro 328 

Larranaga, Pedro 454 
Lebanon, Cuy 185 
Lee, Tai Sing 118 
Lefebure, Martin 592 
Leitao, Jose M.N. 375 
Lu, Juwei 407 

Marques, Jorge S. 63, 576 
Marsden, Jerrold E. 528 
Martin, Jeffrey T. 328 
Marzani, Franck 480 
Massaro, Alessio 469 
Matula, Pavel 608 
Mir, Arnau 134 
Morris, Robin D. 105 
Moura, Jose M.F. 34 
Murino, Vittorio 75, 91 

Pao, Hsing-Kuo 544 
Pelillo, Marcello 423, 469 
Perchant, Aymeric 454 

Rangarajan, Anand 153 
Robles-Kelly, Antonio 251 
Rocha, Jairo 134 

Sagiv, Chen 621 
Sakauchi, Masao 298 
Samson, Christophe 344 
Sanches, Joao M. 63 
Schellewald, Christian 267 
Schnorr, Christoph 267 
Shi, Jianbo 283 
Siddiqi, Kaleem 636 
Smelyanskiy, Vadim N. 105 
Sochen, Mr A. 621 
Svoboda, David 608 
Sziranyi, Tamas 201 

Toth, Zoltan 201 
Toh, Kar-Ann 391, 407 
Torsello, Andrea 438 
Trouve, Alain 50 

Vasilevskiy, Alexander 636 




652 



Author Index 



Wesolkowski, Slawo 314 
Woo, Sanghyuk 480 

Yau, Wei-Yun 407 
Younes, Laurent 513 
Yu, Stella X. 118, 283 
Yu, Yong 50 



Yuille, Alan 3 

Zeevi, Yehoshua Y. 621 
Zerubia, Josiane 344 
Zoller, Thomas 235 
Zucker, Steven W. 497 




